Forum: War Ensemble BBS

Intel's Software Defined Super Cores

From John Savard@quadibloc@invalid.invalid to comp.arch on Mon Sep 15 23:54:12 2025

From Newsgroup: comp.arch

When I saw a post about a new way to do OoO, I had thought it might be
talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Sep 16 00:03:51 2025

From Newsgroup: comp.arch

On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

On further reflection, this may be equivalent to re-inventing out-of-order execution.

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Sep 15 17:19:36 2025

From Newsgroup: comp.arch

On 9/15/2025 4:54 PM, John Savard wrote:

When I saw a post about a new way to do OoO, I had thought it might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

Two weeks ago, I saw this in Tom's Hardware.

https://www.tomshardware.com/pc-components/cpus/intel-patents-software-defined-supercore-mimicking-ultra-wide-execution-using-multiple-cores

But at this point, it is just a patent. While it *might* get included
in a future product, it seems a long way away, if ever.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Sep 15 17:56:28 2025

From Newsgroup: comp.arch

On 9/15/2025 4:54 PM, John Savard wrote:

When I saw a post about a new way to do OoO, I had thought it might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

We would have to somehow tell the system that the program only uses a
single thread, right? Not exactly sure how the sync is going to work
with regard to multi-threaded and/or multi process programs?

A single threaded program runs, then it calls into a function that
creates a thread. Humm...

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

Can one get something kind of akin to it by a clever use of affinity
masks? But, those are not 100% guaranteed?
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Sep 16 10:13:35 2025

From Newsgroup: comp.arch

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

Sounds like [multiscalar processors](doi:multiscalar processor)

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Sep 16 10:15:04 2025

From Newsgroup: comp.arch

Sounds like [multiscalar processors](doi:multiscalar processor)

^^^^^^^^^^^^^^^^^^^^^
10.1145/223982.224451

[ I guess it can be useful to actully look at what one pasts before
pressing "send", eh? ]

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Sep 16 15:10:09 2025

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

[ I guess it can be useful to actully look at what one pasts before
pressing "send", eh? ]

This is sooooo 2010's. Next, you'll be claming it makes sense to
think before writing, and where would we be then? Not in the age
of modern social media, that's for sure.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Sep 16 15:50:38 2025

From Newsgroup: comp.arch

John Savard <quadibloc@invalid.invalid> posted:

When I saw a post about a new way to do OoO, I had thought it might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.

Andy Glew was working on stuff like this 10-15 years ago

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Tue Sep 16 13:01:30 2025

From Newsgroup: comp.arch

On Tue, 16 Sep 2025 00:03:51 -0000 (UTC), John Savard <quadibloc@invalid.invalid> wrote:

On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

On further reflection, this may be equivalent to re-inventing out-of-order >execution.

John Savard

Sounds more like dynamic micro-threading.

Over the years I've seen a handful of papers about compile time micro-threading: that is the compiler itself identifies separable
dependency chains in serial code and rewrites them into deliberate
threaded code to be executed simultaneously.

It is not easy to do under the best of circumstances and I've never
seen anything about doing it dynamically at run time.

To make a thread worth rehosting to another core, it would need to be
(at least) many 10s of instructions in length. To figure this out
dynamically at run time, it seems like you'd need the decode window to
be 1000s of instructions and a LOT of "figure-it-out" circuitry.

MMV, but to me it doesn't seem worth the effort.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Sep 17 11:54:09 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

John Savard <quadibloc@invalid.invalid> posted:

When I saw a post about a new way to do OoO, I had thought it might be
talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting
programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it. >>
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

Andy Glew was working on stuff like this 10-15 years ago

That's what immediately fell to my mind as well, it looks a lot like
trying some of his ideas about scouting micro-threads, doing work in the
hope that it will turn out useful.

To me it sounds like it is related to eager execution, except skipping
further forward into upcoming code.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 17 14:34:09 2025

From Newsgroup: comp.arch

On Wed, 17 Sep 2025 11:54:09 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

John Savard <quadibloc@invalid.invalid> posted:

When I saw a post about a new way to do OoO, I had thought it
might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core-
intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by
splitting programs into chunks that can be performed in parallel
on different cores, where the cores are intimately connected in
order to make this work.

This is a sound idea, but one may not find enough opportunities to
use it.

Although it's called "inverse hyperthreading", this technique
could be combined with SMT - put the chunks into different threads
on the same core, rather than on different cores, and then one
wouldn't need to add extra connections between cores to make it
work.

Andy Glew was working on stuff like this 10-15 years ago

That's what immediately fell to my mind as well, it looks a lot like
trying some of his ideas about scouting micro-threads, doing work in
the hope that it will turn out useful.

To me it sounds like it is related to eager execution, except
skipping further forward into upcoming code.

Terje

The question is what is most likely meaning of the fact of patenting?
IMHO, it means that they explored the idea and decided against going in
this particular direction in the near and medium-term future.

I think that when Intel actually plans to use particular idea then they
keep the idea secret for as long as they can and either don't patent at
all or apply for patent after release of the product.
I can be wrong about it.

On the other hand,
Some of the people that issued the patent appear to be leading figures
in Intel's P-core teams. Some of them 1 year ago gave representations
about advantages of removal of SMT. Removal of SMT and this super-core
idea can be considered complimentary - both push into direction of
cores with smaller # of EU pipes. So, may be, an idea was seriously
considered for Intel products in mid-term future.
Anyway, couple of months ago Tan himself said that Intel is reversing
the decision to remove SMT. Which probably means that all their mid-term
future plans are undergoing significant changes.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Sep 17 13:46:33 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

The question is what is most likely meaning of the fact of patenting?
IMHO, it means that they explored the idea and decided against going in
this particular direction in the near and medium-term future.

I think that when Intel actually plans to use particular idea then they
keep the idea secret for as long as they can and either don't patent at
all or apply for patent after release of the product.
I can be wrong about it.

That would risk that somebody without patent exchange agreements with
Intel patents the invention first (whether independently developed or
due to a leak). Advantages of such a strategy: Companies with patent
exchange agreements learn even later about the invention, and the
patent expires at a later date.

I remember an article about alias prediction (IIRC for executing
stores before architecturally earlier loads), where the author read a
patent from Intel and did some measurements on a released Intel CPU,
and confirming that they actually implemented what the patent
described.

If you find that article, and compare the data when the patent was
submitted to the date of the release of the processor, you can check
your theory.

Some of them 1 year ago gave representations
about advantages of removal of SMT.

I did not read any accounts of that that appeared particularly
knowledgeable. What are the advantages, or where can I read about
these presentations?

Removal of SMT and this super-core
idea can be considered complimentary - both push into direction of
cores with smaller # of EU pipes.

What do you mean by that? Narrower cores? In recent years cores seem
to have exploded in width. From 1995 up to and including 2018 Intel
produced 3-wide and 4-wide designs (with 4-wide coming IIRC with Sandy
Bridge in 2011), and since then even the Skymont E-core has grown to
8-wide, with 26 execution ports and 16-wide retirement. And other CPU manufacturers have also increased the widths of their CPUs.

It seems that there has been a breakthrough in extracting ILP, making
wider cores pay off better, a breakthrough in designing wider register
renamers and making other structures wider, or both.

Pushing for narrower cores appears unplausible to me at this stage.

Concerning the removal of SMT, I can only guess, but that did not
appear unplausible to me with Intel's hybrid CPUs: They have P-cores
for fast single-thread performance, and lots of E-cores for
multi-thread performance. You allocate threads that need
single-thread performance to P-cores and threads that don't to
E-cores. If you have even more tasks, i.e., a heavily multi-threaded
load, do you want to slow down the threads that run on the P-cores by
switching them to SMT mode, also increasing the already-high power
consumption of the P-cores, lowering the clock of everything to stay
within the power limit, and thus possibly the performance? If not,
you don't need SMT.

Still, after touting the SMT horn for so long, I don't expect that
such considerations are the only ones. There must be a significant
advantage in design complexity or die area when leaving it away
(contradicting the earlier claim that SMT costs very little).

Concerning super cores, whatever it is, my guess is that the idea is
to try to extract even more performance from (as far as software is
concerned) single-threaded programs than achievable with the wide
cores of today.

Anyway, couple of months ago Tan himself said that Intel is reversing
the decision to remove SMT.

On the servers, they do not follow the hybrid strategy, for whatever
reason, so the thoughts above don't apply there. And maybe they found
that the cloud providers want SMT, in order to sell their customers
twice as many "CPUs".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Sep 17 13:07:49 2025

From Newsgroup: comp.arch

On 9/15/2025 6:54 PM, John Savard wrote:

When I saw a post about a new way to do OoO, I had thought it might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

Say, more cores and less power use, at the possible expense of some
amount of performance.

...

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Sep 17 18:53:24 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

For a later perspective, see

https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Sep 17 18:54:01 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 9/15/2025 6:54 PM, John Savard wrote:

When I saw a post about a new way to do OoO, I had thought it might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

Transmeta tried and failed to do this.

Say, more cores and less power use, at the possible expense of some
amount of performance.

...

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 17 23:00:15 2025

From Newsgroup: comp.arch

On Wed, 17 Sep 2025 18:53:24 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

Not really.

First, translation on the fly does not count.

Second, even for translation on the fly, only ancient K6 worked that
way. Their later chip did a lot of work at the level of macro-ops,
which in majority of cases have one-to-one correspondence to original
x86 load-op and load-op-store instructions.

Actually, I am not 100% sure about Bulldozer and derivatives, but K7,
K8 and all generations of Zen are using macro-ops.

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

Badly outdated text.

For a later perspective, see

https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Sep 17 20:19:14 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 9/15/2025 6:54 PM, John Savard wrote:

When I saw a post about a new way to do OoO, I had thought it might be
talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting
programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it. >>
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

That was tried three decades ago. https://en.wikipedia.org/wiki/Transmeta

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Wed Sep 17 21:33:17 2025

From Newsgroup: comp.arch

According to BGB <cr88192@gmail.com>:

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

That sounds a whole lot like what Transmeta did 25 years ago:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Sep 18 05:27:15 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

Intel has already done so, although AFAIK not at the firmware level:
Every IA-64 CPU starting with the Itanium II did not implement IA-32
in hardware (unlike the Itanium), but instead used dynamic translation.

There is no reason for Intel to repeat this mistake, or for anyone
else to go there, either.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Sep 18 05:31:29 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in <2015Dec6.152525@mips.complang.tuwien.ac.at>:

|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions with
|around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop" |standing for "RISC").

Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.

As Mitch Alsup has written, AMD has its MacroOps (load-op and RMW) in
addition to the Rops. It's not entirely clear which parts of the
engine see MacroOps and ROPs, but my impression was that the MacroOps
are not split into ROPs for the largest part of the OoO engine.

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

From 1998. Unfortunately, there are not many more recent books about
the microarchitecture of OoO CPUs. What I have found:

Modern Processor Design: Fundamentals of Superscalar Processors
John Paul Shen, Mikko H. Lipasti
McGraw-Hill
656 pages
published 2004 or so (don't let the 2013 date from the reprint fool you) Discusses CPU design (not just OoO) using various real CPUs from the
1990s as example.

Processor Microarchitecture -- An Implementation Perspective
Antonio Gonzalez , Fernando Latorre , Grigorios Magklis
Springer
published 2010
Relatively short, discusses the various parts of an OoO CPU and how to implement them.

Henry Wong
A Superscalar Out-of-Order x86 Soft Processor for FPGA
Ph.D. thesis, U. Toronto
https://www.stuffedcow.net/files/henry-thesis-phd.pdf
Slides: https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf published 2017

A problem is that the older books don't cover recent developments such
as alias prediction and that Wong was limited by what a single person
can do (his work was not part of a larger research project at
U. Toronto), as well as what fits into an FPGA.

BTW, Wong's work can be seen as a refutation of BGB's statement: He
chose to implement IA-32; on slide 14 of <https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf> he
states "It’s easy to implement!".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Sep 18 06:14:30 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes: >https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

It definitely was. However, even a modern high-performance OoO cores
like Apple M1-M4's P-cores or on Qualcomm's Oryon, the performance of dynamically-translated AMD64 code is usually slower than on comparable
CPUs from Intel and AMD.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Sep 18 03:39:57 2025

From Newsgroup: comp.arch

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <cr88192@gmail.com>:

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

That sounds a whole lot like what Transmeta did 25 years ago:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy efficiency or core count (and, in those days, processors were generally single-core).

Now we have a different situation:
Moore's law is dying off;
Scalar CPU performance has hit a plateau;
And, for many uses, performance is "good enough";
A lot more software can make use of multi-threading;
...

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance
on a comparably smaller and cheaper core, and with a somewhat better "performance per watt" metric.

So, one possibility could be, rather than a small number of big/fast
cores (either VLIW or OoO), possibly a larger number of smaller cores.

The cores could maybe be LIW or in-order RISC.

One possibility could be that virtual processors don't run on a single
core, say:
The logical cores exist more as VMs each running a virtual x86 processor
core;
The dynamic translation doesn't JIT translate to a linear program.

Say:
Breaks code into traces;
Each trace uses something akin to CSP mixed with Pi-Calculus;
Address translation is explicit in the ISA, with specialized ISA level memory-ordering and control-flow primitives.

For example, there could be special ISA level mechanisms for submitting
a job to a local job-queue, and pulling a job from the queue.
Memory accesses could use a special "perform a memory access or branch-subroutine" instruction ("MEMorBSR"), where the MEMorBSR
operations will try to access memory, either continuing to the next instruction (success) or Branching-to-Subroutine (access failed).

Where the failure cases could include (but not limited to) TLB miss;
access fault; memory ordering fault; ...

The "memory ordering fault" case could be, when traces are submitted to
the queue, if they access memory, they are assigned sequence numbers
based on Load and Store operations. When memory is accessed, the memory
blocks in the cache could be marked with sequence numbers when read or modified. On access, it could detect if/when memory access have
out-of-order sequence numbers, and then fall back to special-case
handling to restore the intended order (reverting any "uncommitted"
writes, and putting the offending blocks back into the queue to be
re-run after the preceding blocks have finished).

Possibly, the caches wouldn't directly commit stores to memory, but
instead could keep track of a group of cache lines as an "in-flight" transaction. In this case, it could be possible for a "logically older"
block to see the memory as it was before a more recent transaction, but
an out-of-order write could be detected via sequence numbers (if seen,
it would mean a "future" block had run but had essentially read stale data).

Once a block is fully committed (after all preceding blocks are
finished) its contents can be written back out to main RAM.
Could be held in an area of RAM local to the group of cores running the logical core.

Possibly, such a core might actually operate in multiple address spaces:
Virtual Memory, via the transaction oriented MEMorBSR mechanism;
There would likely be an explicit TLB here.
So, TLB Miss handling could be essentially a runtime call.
Local Memory:
Physical Address, small non-externally-visible SRAM;
Divided into Core-Local and Group-Shared areas;
Physical Memory:
External DRAM or similar;
Resembles more traditional RAM access (via Load/Store Ops);
Could be used for VM tasks and page-table walks.

Would likely require significant hardware level support for things like job-queues and synchronization mechanisms.

One possibility could be that some devices could exist local to a group
of cores, which then have a synchronous "first come, first serve" access pattern (possibly similar to how my existing core design manages MMIO).

Possibly it could work by passing fixed-size messages over a bus, with
each request/response pair to a device being synchronous.

Possibly the JIT could try to infer possible memory aliasing between
traces, and enforce sequential ordering if alias is likely. This because performing the operations in the correct order the first time is likely
to be cheaper than detecting an ordering violation and rolling back a transaction.

Whereas proving that traces can't alias is likely to be a much harder
problem than inferring a probable absence of aliasing. If no order
violations occur during operation, it can be safely assumed that no
memory aliasing happened.

Maintaining transactions would complicate the cache design though, since
now there is a problem that the cache line can't be written back or
evicted until its write-associated sequence is fully committed.

Might also need to be separate queue spots for "tasks currently being
worked on" vs "to be done after the current jobs are done". Say, for
example, if a job needs to be rolled-back and re-run, it would still
need to come before jobs that are further in the future relative to itself.

Unlike memory, register ordering is easier to infer statically, at least
in the absence of dynamic branching.

Might need to enforce ordering in cases where:
Dynamic branch occurs and the path can't be followed statically;
A following trace would depend on a register modified in a preceding trace;
...

As for how viable any of this is, I don't know...

The VM could be a lot simpler if one assumes a single threaded VM.

Also unclear is if an ISA could be designed in a way to keep overheads
low enough (would be a waste if the multi-threaded VM is slower than a
single threaded VM would have been). But, this would require a lot of
exotic mechanisms, so dunno...

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Sep 18 03:58:16 2025

From Newsgroup: comp.arch

On 9/18/2025 1:14 AM, Anton Ertl wrote:

John Levine <johnl@taugh.com> writes:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

It definitely was. However, even a modern high-performance OoO cores
like Apple M1-M4's P-cores or on Qualcomm's Oryon, the performance of dynamically-translated AMD64 code is usually slower than on comparable
CPUs from Intel and AMD.

But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...

Then there is also Perf/$, and if such a CPU can win in both Perf/W and Perf/$, then it can still win even if it is slower, by throwing more
cores at the problem.

Though, the possibly interesting idea could be trying for a
multi-threaded translation rather than a single threaded translation.
But, to have any hope, a multi-threaded translation is likely to need
exotic ISA features; whereas a single threaded VM could probably run
mostly OK on normal ARM or RISC-V or similar (well, assuming a world
where RiSC-V addresses some more of its weak areas; but then again, with recent proposals for indexed load/store and auto-increment popping up,
this is starting to look more likely...).

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 18 17:51:36 2025

From Newsgroup: comp.arch

On Thu, 18 Sep 2025 05:27:15 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

BGB <cr88192@gmail.com> writes:

Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an x86
chip by running *everything* in a firmware level emulator via
dynamic translation.

Intel has already done so, although AFAIK not at the firmware level:
Every IA-64 CPU starting with the Itanium II did not implement IA-32
in hardware (unlike the Itanium), but instead used dynamic
translation.

That's imprecise.
First couple of generations of Itanium 2 (McKinley, Madison) still had
IA-32 hardware. Gone in Montecito (2006).
Dynamic translation of application code was available much earlier,
indeed, but early removal of [crappy] hardware colution was probably
considered too risky.

There is no reason for Intel to repeat this mistake, or for anyone
else to go there, either.

- anton

As said by just about everybody, BGB's proposal is most similar
to Transmeta. What was not said by everybody is that similar approach
was tried for Arm, by NVidia none the less. https://en.wikipedia.org/wiki/Project_Denver

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 18 16:16:54 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

With a very loose definition of RISC::

a)Does a RISC ISA contain memory reference address generation from
the pattern [Rbase+Rindex<<scale+Displacement] ??
Some will argue yes, others no.

b) does a RISC ISA contain memory reference instructions that are
combined with arithmetic calculations ??
Some will argue yes, others no.

c) does a RISC ISA contain memory reference instructions that
access memory twice ??? LD-OP-ST :: but the TLB only once ?!?
Most would argue no.

Yet, this is the µISA of K7 and K8. It is only RISC in the very
loosest sense of the word.

And do not get me started on the trap/exception/interrupt model.

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

For a later perspective, see

https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Sep 18 12:33:44 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in <2015Dec6.152525@mips.complang.tuwien.ac.at>:

|Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions with |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop" |standing for "RISC").

I don't know what you are objecting to - Intel calls its internal
instructions micro-operations or uOps, and AMD calls its Rops.
The term is widely used to mean something that executes internally.
Beyond that it depends on the specific of each micro-architecture.

The number of bits has nothing to do with what it is called.
If this uOp was for a ROB style design where all the knowledge about
each instruction including register ids, immediate data,
scheduling info, result data, status, is stored in a single ROB entry,
then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.

Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.

And a uOp triggers that action sequence.
I don't see the distinction you are trying to make.

As Mitch Alsup has written, AMD has its MacroOps (load-op and RMW) in addition to the Rops. It's not entirely clear which parts of the
engine see MacroOps and ROPs, but my impression was that the MacroOps
are not split into ROPs for the largest part of the OoO engine.

AMD explains there terminology here but note that the relationship
between Macro-Ops and Micro-Ops is micro-architecture specific.

A Seventh-Generation x86 Microprocessor, 1999 https://www.academia.edu/download/70925991/4.79985120211001-19357-4pufup.pdf

"An [micro-]OP is the minimum executable entity understood by the machine."
A macro-op is a bundle of 1 to 3 micro-ops.
Simple instructions map to 1 macro and 1-3 micro ops
and this mapping is done in the decoder.
Complex instructions map to one or more "micro-lines" each of which
consists of 3 macro-ops (of 1-3 micro-ops each) pulled from micro-code ROM.

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

From 1998. Unfortunately, there are not many more recent books about
the microarchitecture of OoO CPUs. What I have found:

Modern Processor Design: Fundamentals of Superscalar Processors
John Paul Shen, Mikko H. Lipasti
McGraw-Hill
656 pages
published 2004 or so (don't let the 2013 date from the reprint fool you) Discusses CPU design (not just OoO) using various real CPUs from the
1990s as example.

Processor Microarchitecture -- An Implementation Perspective
Antonio Gonzalez , Fernando Latorre , Grigorios Magklis
Springer
published 2010
Relatively short, discusses the various parts of an OoO CPU and how to implement them.

Henry Wong
A Superscalar Out-of-Order x86 Soft Processor for FPGA
Ph.D. thesis, U. Toronto https://www.stuffedcow.net/files/henry-thesis-phd.pdf
Slides: https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf published 2017

A problem is that the older books don't cover recent developments such
as alias prediction and that Wong was limited by what a single person
can do (his work was not part of a larger research project at
U. Toronto), as well as what fits into an FPGA.

BTW, Wong's work can be seen as a refutation of BGB's statement: He
chose to implement IA-32; on slide 14 of <https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf> he
states "It’s easy to implement!".

- anton

Other micro-architecture related sources since 2000:

Book
A Primer on Memory Consistency and Cache Coherence 2nd Ed, 2020
Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood

Dissertation
Complexity and Correctness of a Super-Pipelined Processor, 2005
Jochen Prei�

Book
General-Purpose Graphics Processor Architectures, 2018
Aamodt, Wai Lun Fung, Rogers

Book
Microprocessor Architecture
From Simple Pipelines to Chip Multiprocessors, 2010
Jean-Loup Baer

Book
Processor Microarchitecture An Implementation Perspective, 2011
Antonio Gonz�lez, Fernando Latorre, and Grigorios Magklis

This is a bit introductory level:

Book
Computer Organization and Design
The Hardware/Software Interface: RISC-V Edition, 2018
Patterson, Hennessy

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 18 20:26:29 2025

From Newsgroup: comp.arch

On Thu, 18 Sep 2025 12:33:44 -0400
EricP <ThatWouldBeTelling@thevillage.com> wrote:

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at
least this nonsense has been corrected often enough. E.g., I wrote
in <2015Dec6.152525@mips.complang.tuwien.ac.at>:

|Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the "R"
of "Rop" |standing for "RISC").

I don't know what you are objecting to - Intel calls its internal instructions micro-operations or uOps, and AMD calls its Rops.

No, they don't. They stopped using term Rops almost 25 years ago.
If they used it in early K7 manuals then it was due to inertia (K6
manuals copy&pasted without much of thought given) and partly because
of marketing, because RISC was considered cool.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Sep 18 14:42:36 2025

From Newsgroup: comp.arch

Michael S wrote:

On Thu, 18 Sep 2025 12:33:44 -0400
EricP <ThatWouldBeTelling@thevillage.com> wrote:

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at
least this nonsense has been corrected often enough. E.g., I wrote
in <2015Dec6.152525@mips.complang.tuwien.ac.at>:

|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the "R"
of "Rop" |standing for "RISC").

I don't know what you are objecting to - Intel calls its internal
instructions micro-operations or uOps, and AMD calls its Rops.

No, they don't. They stopped using term Rops almost 25 years ago.
If they used it in early K7 manuals then it was due to inertia (K6
manuals copy&pasted without much of thought given) and partly because
of marketing, because RISC was considered cool.

And the fact that all the RISC processors ran rings around the CISC ones.
So they wanted to promote that "hey, we can go fast too!"

Ok, AMD dropped the "risc" prefix 25 years ago.
That didn't change the way it works internally.

They still use the term "micro op" in the Intel and AMD Optimization guides.
It still means an micro-architecture specific internal simple, discrete
unit of execution, albeit a more complex one as transistor budgets allow.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Sep 18 14:05:04 2025

From Newsgroup: comp.arch

On 9/18/2025 11:16 AM, MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

With a very loose definition of RISC::

a)Does a RISC ISA contain memory reference address generation from
the pattern [Rbase+Rindex<<scale+Displacement] ??
Some will argue yes, others no.

b) does a RISC ISA contain memory reference instructions that are
combined with arithmetic calculations ??
Some will argue yes, others no.

c) does a RISC ISA contain memory reference instructions that
access memory twice ??? LD-OP-ST :: but the TLB only once ?!?
Most would argue no.

Yet, this is the µISA of K7 and K8. It is only RISC in the very
loosest sense of the word.

And do not get me started on the trap/exception/interrupt model.

Still reminds me of the LOL of some of the old marketing for the TI
MSP430 trying to pass it off as RISC:
In practice has variable-length instructions (via @PC+ addressing);
Has auto-increment addressing modes and similar;
Most instructions can operate directly on memory;
Has ability to do Mem/Mem operations;
...

In effect, MSP430 being closer to the DEC PDP-11 than it was to much of anything else in the RISC family.

Even SuperH, which also branched off from similar origins, had gone over
to purely 16-bit instructions, and was Load/Store, so more deserving of
the RISC title (though apparently still a lot more PDP-11 flavored than
MIPS flavored).

Their rationale: "But our instruction listing isn't very long, so RISC", nevermind all of the edge cases they hid off in the various addressing
modes and register combinations.

But, yeah, following similar logic to what TI was using, one could look
at something like the Motorola 68000 and be all like, "Yep, looks like
RISC to me"...

...

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

For a later perspective, see

https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 18 22:56:22 2025

From Newsgroup: comp.arch

On Thu, 18 Sep 2025 14:42:36 -0400
EricP <ThatWouldBeTelling@thevillage.com> wrote:

Michael S wrote:

On Thu, 18 Sep 2025 12:33:44 -0400
EricP <ThatWouldBeTelling@thevillage.com> wrote:

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator
via dynamic translation.

For AMD, that has happend already a few decades ago; they
translate x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at
least this nonsense has been corrected often enough. E.g., I
wrote in <2015Dec6.152525@mips.complang.tuwien.ac.at>:

|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the
"R" of "Rop" |standing for "RISC").

I don't know what you are objecting to - Intel calls its internal
instructions micro-operations or uOps, and AMD calls its Rops.

No, they don't. They stopped using term Rops almost 25 years ago.
If they used it in early K7 manuals then it was due to inertia (K6
manuals copy&pasted without much of thought given) and partly
because of marketing, because RISC was considered cool.

And the fact that all the RISC processors ran rings around the CISC
ones.

In 1988. In 1998 - much less so.

So they wanted to promote that "hey, we can go fast too!"

Ok, AMD dropped the "risc" prefix 25 years ago.
That didn't change the way it works internally.

Of course, they did. Several times.
Even Zen3 works non-trivially differently from Zen1 and 2.
If you stopped following in previous millenium it's your problem rather
than their.

They still use the term "micro op" in the Intel and AMD Optimization
guides. It still means an micro-architecture specific internal
simple, discrete unit of execution, albeit a more complex one as
transistor budgets allow.

By that logic every CISC is RISC, because at some internal level they
executes simple operations. Even those with load-ALU pipeline do load
and ALU at separate stages.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 19 09:50:32 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <cr88192@gmail.com>:

Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

That sounds a whole lot like what Transmeta did 25 years ago:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy >efficiency or core count (and, in those days, processors were generally >single-core).

IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
would be a good idea nowadays, it would not have been canceled.

How should the number of cores change anything? If you cannot make single-threaded IA-32 or AMD64 programs run at competetive speeds on
IA-64 hardware, how would that inefficiency be eliminated in
multi-threaded programs?

Now we have a different situation:
Moore's law is dying off;

Even if that is the case, how should that change anything about the
relative merits of the two approaches?

Scalar CPU performance has hit a plateau;

True, but again, what's the relevance for the discussion at hand?

And, for many uses, performance is "good enough";

In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.

A lot more software can make use of multi-threading;

Possible, but how would it change things?

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance
on a comparably smaller and cheaper core, and with a somewhat better >"performance per watt" metric.

Evidence?

Yes, you can run CPUs with Intel P-cores and AMD's non-compact cores
with higher power limits than what the Apple and Qualcomm chips
approximately consume (I have not seen proper power consumption
numbers for these since Anandtech stopped publishing), but you can
also run Intel CPUs and AMD CPUs at low power limits, with much better "performance per watt". It's just that many buyers of these CPUs care
about performance, not performance per watt.

And if you run AMD64 software on your binary translator on CPUs with
e.g., ARM A64 architecture, the performance per watt is worse than
when running it on an AMD64 CPU.

So, one possibility could be, rather than a small number of big/fast
cores (either VLIW or OoO), possibly a larger number of smaller cores.

The cores could maybe be LIW or in-order RISC.

The approach of a large number of small, slow cores has been tried,
e.g., in the TILE64, but has not been successful with that core size.
Other examples are Sun's UltraSparc T1000 and followons, which were
somewhat more successful, but eventually led to the cancellation of
SPARC.

Finally, Intel now offers E-core-only chips for clients (e.g., N100)
and servers (Sierra Forest), but they have not stopped releasing
P-Core-only server CPUs. For the desktop the CPU with the largest
numbers of E-Cores (16) also hase 8 P-cores, so Intel obviously
believes that not all desktop applications are embarrassingly
parallel.

Intel used to have Xeon Phi CPUs with a higher number of narrower
cores, but eventually replaced them with Xeon processors that have
fewer, but more powerful cores.

AMD offers compact-core-only server CPUs with more cores and less
cache per core, but otherwise the same microarchitecture, only with a
much lower clock ceiling. (There is a difference in microarchitecture
wrt execurting AVX-512 instructions on Zen5, but that's minor). AMD
also offers server CPUs with non-compact cores; interestingly, if we
compare CPUs with the same numbers of cores, the launch price (at the
same date) is not that far apart:

GHz
Model cores base boost cache TDP launch current
EPYC 9755 128 2.7 4.1 512MB 500W USD12984 EUR5979
EPYC 9745 128 2.3 3.7 256MB 400W USD12141 EUR4192

Current pricing from <https://geizhals.eu/?cat=cpuamdam4&xf=12099_Server~25_128~596_Turin~596_Turin+Dense>;
however, the third-cheapest dealer for the 9745 asks for EUR 6129, and
the cheapest price up to 2025-09-10 has been EUR 6149, so the current
price difference may be short-lived. The cheapest price for the 9755
was 4461 on 2025-08-25, and at that time the 9755 was cheaper than the
9745 (at least as far as the prices seen by the website above are
concerned).

I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.

The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more
expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).

The bandwidth requirements to main memory for given cache sizes per
core reduce linearly with the performance of the cores; if the larger
number of smaller cores really leads to increased aggregate
performance, additional main memory bandwidth is needed, or you can
compensate for that with larger caches.

But to eliminate some variables, let's just consider the case where we
want to get the same performance with the same main memory bandwidth
from using more smaller cores than we use now. Will the resulting CPU
require less area? The cache sizes per core are not reduced, and
their area is not reduced much. The core itself will get smaller, and
its performance will also get smaller (although by less than the
core). But if you sum up the total per-core area (core, caches, and interconnect), at some point the per-core area reduces by less than
the per-core performance, so for a given amount of total performance,
the area goes up.

There is one counterargument to these considerations: The largest
configuration of Turin dense has less cache for more cores than the
largest configuration of Turin. I expect that's the reason why they
offer both; if you have less memory-intensive loads, Turin dense with
the additional cores will give you more performance, otherwise you
better buy Turin.

Also, Intel has added 16 E-Cores to their desktop chips without giving
them the same amount of caches as the P-Cores; e.g., in Arrow lake we
have

P-core 48KB D-L0 64KB I-L1 192KB D-L1 3MB L2 3MB L3/core
E-Core 32KB D-L1 64KB I-L1 4MB L2/4 cores 3MB L3/4cores

Here we don't have an alternative with more P-Cores and the same
bandwidth, so we cannot contrast the approaches. But it's certainly
the case that if you have a bandwidth-hungry load, you don't need to
buy the Arrow Lake with the largest number of E-Cores.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 19 14:33:44 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently
higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

[RISC-V]

recent proposals for indexed load/store and auto-increment popping up,

Where can I read about that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Sep 19 18:12:38 2025

From Newsgroup: comp.arch

On Fri, 19 Sep 2025 09:50:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.

The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).

That particualr problem is addressed by grouping smaller cores into
clusters with shared L2 cache. It's especially effective for scaling
when L2 cache is true inclusive relatively to underlying L1 caches.
The price is limited L2 bandwidth as seen by the cores.

BTW, I didn't find any info about replacement policy of Intel's Sierra
Forest L2 caches.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 19 15:05:56 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in
<2015Dec6.152525@mips.complang.tuwien.ac.at>:

|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions with
|around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop"
|standing for "RISC").

I don't know what you are objecting to

I am objecting to the claim that uops are RISC-like, and that there is
a translation to RISC occuring inside the CPU, and (not present here,
but often also claimed) that therefore there is no longer a difference
between RISC and non-RISC.

One can discuss the details, but at the end of the day, uops are some implementation-specific internals of the microarchitecture, whereas a
RISC architecture is an architecture.

The number of bits has nothing to do with what it is called.
If this uOp was for a ROB style design where all the knowledge about
each instruction including register ids, immediate data,
scheduling info, result data, status, is stored in a single ROB entry,
then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.

Yes, P6 is the code name for the Pentium Pro, which has a ROB, and,
more importantly valued reservation stations, and yes, the 118 or
whatever bits include the operands. I have no idea how the P6 handles
its 80-bit FP with valued RSs; maybe it has bigger uops in its FP part
(but I think it has a unified scheduler, so that would not work out,
or maybe I miss something).

But concerning the discussion at hand: Containing the data is a
significant deviation from RISC instruction sets, and RISC
instructions are typically only 32 bits or 16 bits wide.

Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch
prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.

And a uOp triggers that action sequence.
I don't see the distinction you are trying to make.

The major point is that the OoO engine (the part that deals with uops)
sees a linear sequence of uops it has to process, with nearly all
actual branch processing (which an architecture has to do) done in a
part that does not deal with uops. With the advent of uop caches that
has changed a bit, but many of the CPUs for which the uop=RISC claim
has been made do not have an uop cache.

It's not entirely clear which parts of the
engine see MacroOps and ROPs, but my impression was that the MacroOps
are not split into ROPs for the largest part of the OoO engine.

AMD explains there terminology here but note that the relationship
between Macro-Ops and Micro-Ops is micro-architecture specific.

A Seventh-Generation x86 Microprocessor, 1999 >https://www.academia.edu/download/70925991/4.79985120211001-19357-4pufup.pdf

"An [micro-]OP is the minimum executable entity understood by the machine."
A macro-op is a bundle of 1 to 3 micro-ops.
Simple instructions map to 1 macro and 1-3 micro ops
and this mapping is done in the decoder.
Complex instructions map to one or more "micro-lines" each of which
consists of 3 macro-ops (of 1-3 micro-ops each) pulled from micro-code ROM.

Yes, so much is clear. It's not clear where Macro-Ops are in play and
where Micro-Ops are in play. Over time I get the impression that the
macro-ops are the main thing running through the OoO engine, and
Micro-Ops are only used in specific places, but it's completely
unclear to me where. E.g., if they let an RMW Macro-Op run through
the OoO engine, it would first go to the LSU for the address
generation, translation and load, then to the ALU for the
modification, then to the LSU for the store, and then to the ROB.
Where in this whole process is a Micro-Op actually stored?

This is a bit introductory level:

Book
Computer Organization and Design
The Hardware/Software Interface: RISC-V Edition, 2018
Patterson, Hennessy

Their "Computer Architecture" book is also revised every few years,
but their treatment of OoO makes me think that they are not at all
interested in that part anymore, instead more in, e.g., multiprocessor
memory subsystems.

And the fact that we see so few recent books on the topics makes me
think that many in academia have decided that this is a topic that
they leave to industry.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 19 16:14:53 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

-------------------------------

Yes, so much is clear. It's not clear where Macro-Ops are in play and
where Micro-Ops are in play. Over time I get the impression that the macro-ops are the main thing running through the OoO engine, and
Micro-Ops are only used in specific places, but it's completely
unclear to me where. E.g., if they let an RMW Macro-Op run through
the OoO engine, it would first go to the LSU for the address
generation, translation and load, then to the ALU for the
modification, then to the LSU for the store, and then to the ROB.
Where in this whole process is a Micro-Op actually stored?

In the reservation station.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 19 16:23:06 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

BGB <cr88192@gmail.com> writes:

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <cr88192@gmail.com>:

--------------------------------------

I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.

Yes, exactly:: if you have a large number of cores doing a performance of
X, they will need exactly the same memory BW as a smaller number of cores
also performing at X.

In addition, the interconnect has to be at least as good as the small core system.

The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).

The bandwidth requirements to main memory for given cache sizes per
core reduce linearly with the performance of the cores; if the larger
number of smaller cores really leads to increased aggregate
performance, additional main memory bandwidth is needed, or you can compensate for that with larger caches.

Sooner or later, you actually have to read/write main memory.

But to eliminate some variables, let's just consider the case where we
want to get the same performance with the same main memory bandwidth
from using more smaller cores than we use now. Will the resulting CPU require less area? The cache sizes per core are not reduced, and
their area is not reduced much.

A core running at ½ the performance can use a cache that is ¼ the size
and see the same percentage degradation WRT cache misses (as long as
main memory is equally latent). TLBs too.

The core itself will get smaller, and

12× smaller and 12× lower power

its performance will also get smaller (although by less than the
core).

for ½ the performance

But if you sum up the total per-core area (core, caches, and interconnect), at some point the per-core area reduces by less than
the per-core performance, so for a given amount of total performance,
the area goes up.

GBOoO Cores tend to be about the size of 512KB of L2

There is one counterargument to these considerations: The largest configuration of Turin dense has less cache for more cores than the
largest configuration of Turin. I expect that's the reason why they
offer both; if you have less memory-intensive loads, Turin dense with
the additional cores will give you more performance, otherwise you
better buy Turin.

Also, Intel has added 16 E-Cores to their desktop chips without giving
them the same amount of caches as the P-Cores; e.g., in Arrow lake we
have

P-core 48KB D-L0 64KB I-L1 192KB D-L1 3MB L2 3MB L3/core
E-Core 32KB D-L1 64KB I-L1 4MB L2/4 cores 3MB L3/4cores

Here we don't have an alternative with more P-Cores and the same
bandwidth, so we cannot contrast the approaches. But it's certainly
the case that if you have a bandwidth-hungry load, you don't need to
buy the Arrow Lake with the largest number of E-Cores.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Sep 19 11:41:19 2025

From Newsgroup: comp.arch

On 9/19/2025 9:33 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...

Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

A lot of the ARM SoC's I had seen had lower TDPs, though more often with Cortex A53 or A55/A78 cores or similar:

Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.

Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

TDP 5W, has A55 and A78 cores.

Some amount of the HiSilicon numbers look similar...

But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W

So, more like 10x here, but ...

Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.

Probably need to set up a RasPi with a 64-bit OS at some point and see
how this performs... (wouldn't really be as accurate to compare x86-64
with 32-bit ARM).

[RISC-V]

recent proposals for indexed load/store and auto-increment popping up,

Where can I read about that.

For now, just on the mailing lists, eg: https://lists.riscv.org/g/tech-arch-review/message/368

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Sep 19 12:00:07 2025

From Newsgroup: comp.arch

On 9/19/2025 4:50 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <cr88192@gmail.com>:

Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

That sounds a whole lot like what Transmeta did 25 years ago:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy
efficiency or core count (and, in those days, processors were generally
single-core).

IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
would be a good idea nowadays, it would not have been canceled.

How should the number of cores change anything? If you cannot make single-threaded IA-32 or AMD64 programs run at competetive speeds on
IA-64 hardware, how would that inefficiency be eliminated in
multi-threaded programs?

Now we have a different situation:
Moore's law is dying off;

Even if that is the case, how should that change anything about the
relative merits of the two approaches?

Scalar CPU performance has hit a plateau;

True, but again, what's the relevance for the discussion at hand?

And, for many uses, performance is "good enough";

In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.

Possibly, it depends.

The question is what could Intel or AMD do if the wind blew in that
direction.

For the end-user, the experience is likely to look similar, so they
might not need to know/care if they are using some lower-power native
chip, or something that is internally running on a dynamic translator to
some likely highly specialized ISA.

A lot more software can make use of multi-threading;

Possible, but how would it change things?

Multi-threaded software does not tend to depend as much on single-thread performance as single threaded software...

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance
on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.

Evidence?

No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).

The RasPi basically runs circles around the Eee...

Though, no good datapoints for fast x86 emulators here.
At least DOSBox and QEMU running x86 on RasPi tend to be dead slow.

( no time right now, so skipping rest )

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Sep 19 12:38:51 2025

From Newsgroup: comp.arch

On 9/19/2025 12:00 PM, BGB wrote:

On 9/19/2025 4:50 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <cr88192@gmail.com>:

Still sometimes it seems like it is only a matter of time until
Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>>> hardware level, but then pretends to still be an x86 chip by running >>>>> *everything* in a firmware level emulator via dynamic translation.

That sounds a whole lot like what Transmeta did 25 years ago:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy
efficiency or core count (and, in those days, processors were generally
single-core).

IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
would be a good idea nowadays, it would not have been canceled.

How should the number of cores change anything? If you cannot make
single-threaded IA-32 or AMD64 programs run at competetive speeds on
IA-64 hardware, how would that inefficiency be eliminated in
multi-threaded programs?

Now we have a different situation:
   Moore's law is dying off;

Even if that is the case, how should that change anything about the
relative merits of the two approaches?

   Scalar CPU performance has hit a plateau;

True, but again, what's the relevance for the discussion at hand?

   And, for many uses, performance is "good enough";

In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.

Possibly, it depends.

The question is what could Intel or AMD do if the wind blew in that direction.

For the end-user, the experience is likely to look similar, so they
might not need to know/care if they are using some lower-power native
chip, or something that is internally running on a dynamic translator to some likely highly specialized ISA.

   A lot more software can make use of multi-threading;

Possible, but how would it change things?

Multi-threaded software does not tend to depend as much on single-thread performance as single threaded software...

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance >>> on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.

Evidence?

No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).

The RasPi basically runs circles around the Eee...

Though, no good datapoints for fast x86 emulators here.
At least DOSBox and QEMU running x86 on RasPi tend to be dead slow.

( no time right now, so skipping rest )

Seems I have a little time still...

Did find this: https://browser.geekbench.com/v4/cpu/compare/2498562?baseline=2792960

Not an exact match, I think the Eee was running the Atom at a somewhat
lower clock speed; and this is vs a Pi3 vs original Pi.
The Pi3 having 4x A53 cores.

But, yeah, they are roughly matched on single thread performance when
the Atom has a clock-speed advantage.

Though, this seems to imply that they are more just "comparable" on the performance front, rather than Atom being significantly slower...

Would need to try to dig-out the Eee and re-test, assuming it still
works/etc.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Sep 19 17:48:52 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or >>>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>>> hardware level, but then pretends to still be an x86 chip by running >>>>> *everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in
<2015Dec6.152525@mips.complang.tuwien.ac.at>:

|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions with
|around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop"
|standing for "RISC").

I don't know what you are objecting to

I am objecting to the claim that uops are RISC-like, and that there is
a translation to RISC occuring inside the CPU, and (not present here,
but often also claimed) that therefore there is no longer a difference between RISC and non-RISC.

Ok. I disagree with this because I have a different view of the
changes in moving from CISC to RISC (which I'll describe below).

One can discuss the details, but at the end of the day, uops are some implementation-specific internals of the microarchitecture, whereas a
RISC architecture is an architecture.

The number of bits has nothing to do with what it is called.
If this uOp was for a ROB style design where all the knowledge about
each instruction including register ids, immediate data,
scheduling info, result data, status, is stored in a single ROB entry,
then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.

Yes, P6 is the code name for the Pentium Pro, which has a ROB, and,
more importantly valued reservation stations, and yes, the 118 or
whatever bits include the operands. I have no idea how the P6 handles
its 80-bit FP with valued RSs; maybe it has bigger uops in its FP part
(but I think it has a unified scheduler, so that would not work out,
or maybe I miss something).

But concerning the discussion at hand: Containing the data is a
significant deviation from RISC instruction sets, and RISC
instructions are typically only 32 bits or 16 bits wide.

Yes, and those 32-bit external ISA instructions are mapped into uOps internally. All that is different here is the difficulty for decode.

I see the difference between CISC and RISC as in the micro-architecture, changing from a single sequential state machine view to multiple concurrent machines view, and from Clocks Per Instruction to Instructions Per Clock.

The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.

An important consequence of the sequential design is that
most of this machine is sitting idle most of the time.

One can take an Alpha ISA and implement it with a microcoded sequencer
but that should not be called RISC so the distinction must lie elsewhere.

RISC changes that design to one like a multi-threaded program with
messages passing between them called uOps, where the dynamic state
of each instruction is mostly carried with the uOp message,
and each thread does something very simple and passes the uOp on.
Where global resources are required, they are temporarily dynamically
allocated to the uOp by the various threads, carried with the uOp,
and returned later when the uOp message is passed to the Retire thread.
The Retire thread is the only one which updates the visible global state.

As I see it, this Multiple Simple Thread Message Passing Architecture
(MST-MPA) is the essence of the change RISC invoked, and any
micro-architecture that follows it is in the risc design style.

The RISC design guidelines described by various papers, rather than
go/no-go decisions, are mostly engineering compromises for consideration
of things which would make an MST-MPA more expensive to implement or
otherwise interfere with maximizing the active concurrency of all threads. Whether the register file has 8, 16, or 32 entries affects the frequency
of stalls but doesn't change whether it is implemented as MST-MPA and
therefore entitled to be called "RISC".

This is why I think it would have been possible to build a risc-style
PDP-11 in 1975 TTL, or a VAX if they had just left the instructions of
the same complexity as PDP-11 ISA (53 opcodes, max one immediate,
max one mem op per instruction).

Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch
prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.

And a uOp triggers that action sequence.
I don't see the distinction you are trying to make.

The major point is that the OoO engine (the part that deals with uops)
sees a linear sequence of uops it has to process, with nearly all
actual branch processing (which an architecture has to do) done in a
part that does not deal with uops. With the advent of uop caches that
has changed a bit, but many of the CPUs for which the uop=RISC claim
has been made do not have an uop cache.

There are multiple places that can generate next RIP addresses:
- The incremented RIP for the current instruction
- Branch Prediction can redirect Fetch
- Decode can pick off unconditional branches and immediately redirect Fetch.
- Decode also could notice if a the branch predictor made an erroneous
decision and redirect Fetch.
- Register Read might forward a "JMP reg" address directly to Fetch.
- The Branch Unit BRU has a uOp scheduler to wait for in-flight registers
or condition codes and then processes all branch & jump uOps and
possibly redirects Fetch, and update Branch Prediction.
- uOp Retire detects exceptions and can force a Fetch redirect.
- Interrupts can redirect Fetch.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 07:56:49 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Fri, 19 Sep 2025 09:50:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more
expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).

That particualr problem is addressed by grouping smaller cores into
clusters with shared L2 cache. It's especially effective for scaling
when L2 cache is true inclusive relatively to underlying L1 caches.
The price is limited L2 bandwidth as seen by the cores.

The other price is longer L2 latency; on a Core Ultra 9 285K:

L2 L3 DRAM
Skymont 4.24ns 14.92ns ~180ns
Lion Cove 2.98ns 14.75ns 99.52ns

Numbers from <https://chipsandcheese.com/p/analyzing-lion-coves-memory-subsystem> <https://chipsandcheese.com/p/skymont-in-desktop-form-atom-unleashed>

Estimated from the graph where I could not find numbers.

I wonder what slows down the DRAM access of Skymont on the same chip
so much when the L3 latency is so close.

Yes, organizing the interconnect in a hierarchical way can help reduce
the increase in interconnect cost, but I expect that there is a reason
why Intel did not do that for its server CPUs with P-Cores, by e.g.,
forming clusters of 4, and then continuing with the ring; instead,
they opted for a grid interconnect.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 08:33:37 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

-------------------------------

Yes, so much is clear. It's not clear where Macro-Ops are in play and
where Micro-Ops are in play. Over time I get the impression that the
macro-ops are the main thing running through the OoO engine, and
Micro-Ops are only used in specific places, but it's completely
unclear to me where. E.g., if they let an RMW Macro-Op run through
the OoO engine, it would first go to the LSU for the address
generation, translation and load, then to the ALU for the
modification, then to the LSU for the store, and then to the ROB.
Where in this whole process is a Micro-Op actually stored?

In the reservation station.

Ok, so what I currently imagine is this: The macro-op contains tags or
(for non-valued reservation stations) register numbers for the
intermediate results. It is sent to the affected reservation
stations, which picks the parts relevant for it out of the macro-op,
thus forming a micro-op. If one of the reservation stations is full,
I expect that the macro-op is kept back in the front end. The ROB
does not need to wait for each micro-op, but only for the last one in
the macro-op (if the micro-ops have one last one, which they have in
case of load-op instructions (the op is last) and RMW instructions
(the W is last).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 08:47:10 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

BGB <cr88192@gmail.com> writes:

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <cr88192@gmail.com>: >--------------------------------------

I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.

Yes, exactly:: if you have a large number of cores doing a performance of
X, they will need exactly the same memory BW as a smaller number of cores >also performing at X.

The memory subsystem plays a big role, however. The caches filter
away many of the main-memory accesses.

Sooner or later, you actually have to read/write main memory.

In general, no. If the caches are large enough, the code and data can
be loaded from the disk or the network into the cache, processed
there, and the sent out to the disk or network without ever accessing
DRAM.

And that's not just a theoretical thing: There are network packet
routers with Xeon-D CPUs where the network interfaces deliver the
packets into L3 cache, the program looks at each packet, decides where
it is sent, and performs the appropriate action, all within the
caches; the end result is then consumed by the network interfaces
again. There are 70ns time per packet, so there is no time for a DRAM
access and its latency (I expect that there will be some loading from
DRAM when an unusual route is needed that is not cached, but for the
majority of packets, there is no time for that).

On the more theoretical side, that's the main fallacy in "Hitting the
Memory Wall" (1995). I have written a critique of that paper in 2001
and announced it here <9fst8d$60u$1@news.tuwien.ac.at>; you can find
it on <http://www.complang.tuwien.ac.at/anton/memory-wall.html>.
Interestingly, I have now found a retrospective paper about this from
2004 by McKee: <http://svmoore.pbworks.com/w/file/fetch/59055930/p162-mckee.pdf>; she
mentions comp.arch several times, but apparently missed my posting or
did not find it relevant enough to address it in her retrospective
(given the large number of other reactions that the paper received,
the latter would not be surprising).

But to eliminate some variables, let's just consider the case where we
want to get the same performance with the same main memory bandwidth
from using more smaller cores than we use now. Will the resulting CPU
require less area? The cache sizes per core are not reduced, and
their area is not reduced much.

A core running at ½ the performance can use a cache that is ¼ the size
and see the same percentage degradation WRT cache misses (as long as
main memory is equally latent). TLBs too.

Yes (as long as the latency does not rise), but if they do, the number
of memory accesses filtered out by the caches descreases, and the DRAM bandwidth required by the core increases beyond the 1/2 value. So now
you have twice the number of cores, each with more than 1/2 memory
bandwidth requirement. So you need to increase the memory bandwidth,
or you will lose performance; the mechanism for the lower performance
is that the latency of the memory accesses increases from having to
wait for other memory accesses to be served.

The alternative I outlined is to use same-sized caches (per core), so
the caches filter just as well as with the big cores, and the 2n
smaller cores need the same memory bandwidth as the n larger cores.

12× smaller and 12× lower power
for 1/2 the performance

In a Samsung Exynos9820, the Cortex-A75 has 3-4 times the size of a
Cortex-A55; power and performance depend on where on the
voltage-frequency curve we use these cores; they have similar
performance/watt ranges, and at the same performance/watt, the
performance of the A75 is 3-4 times higher than that of the
A55. <2024Jan24.084731@mips.complang.tuwien.ac.at> <2024Jan24.225412@mips.complang.tuwien.ac.at>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 10:25:40 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 9/19/2025 9:33 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently
higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...

Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

A lot of the ARM SoC's I had seen had lower TDPs, though more often with >Cortex A53 or A55/A78 cores or similar:

Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.

With 1700MHz.

Data from <https://www.complang.tuwien.ac.at/franz/latex-bench>;
numbers are times in seconds:

Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
Core i5-1135G7, 4134MHz, 8MB L3, Ubuntu 21.04 (64-bit) 0.279

The Core i5-1135G7 is limited to 12W on the machine where I measured
this; the Core i5-1135G7 is 8.9 times faster than the 1896MHz Cortex
A53, so the 1700MHz A53 is probably about 9.9 times slower than the
Core i5-1135G7. That's for a single core. I cannot measure how far
the MT6752 clocks down under multi-core/multi-thread load. I did it
for the Core i5-1135G7:

wget http://www.complang.tuwien.ac.at/anton/latex-bench/bench.tex
wget http://www.complang.tuwien.ac.at/anton/latex-bench/types.bib
for i in 0 1 2 3 4 5 6 7; do mkdir $i; cp bench.tex types.bib $i; done
for i in 0 1 2 3 4 5 6 7; do (cd $i; taskset -c $i sh -c "latex bench >/dev/null; bibtex bench >/dev/null; while true; do /bin/time -f\"%U\" latex bench >/dev/null; done" &); done

When using all 8 threads, the CPU clocked itself down to 2100MHz (the
base frequency for TDP=12W is 900MHz), and each LaTeX benchmark ran in 0.93s-0.99s user time (8 in parallel). I.e., 8*0.12s per invocation.

I also measured it with only 4 processes, one for each core. The
clock was 2400-2500MHz, the times 0.51s-0.53s, i.e., 4*0.13s per
invocation. The throughput advantage of SMT is very small here.

Anyway, the Core i5-1135G7 gets one run every 0.12s from 12W, while
the MT6752 gets one run every 0.35s (2.488*1896/1700/8, almost three
times slower) from 7W, even if we can assume it can do 1700MHz on all
cores while staying in the 7W. In any case, the bottom line is that
the Core i5-1135G7 at 12W is more power-efficient than the MT6752, and
that's with the A53 running the benchmark native, not in emulation.

[RISC-V]

recent proposals for indexed load/store and auto-increment popping up,

Where can I read about that.

For now, just on the mailing lists, eg: >https://lists.riscv.org/g/tech-arch-review/message/368

Interesting, thanks.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 11:48:00 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 9/19/2025 4:50 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

And, for many uses, performance is "good enough";

In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.

Possibly, it depends.

The question is what could Intel or AMD do if the wind blew in that >direction.

What direction?

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance >>> on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.

Evidence?

No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).

The RasPi basically runs circles around the Eee...

That's probably a software problem. Different Eee PC models have
different CPUs, Celeron M @571Mhz, 900MHz, or 630MHz, Atoms with
1330-1860Mhz, or AMD C-50 or E350. All of them are quite a bit faster
than the 700Mhz ARM11. While I don't have a Raspi1 result on https://www.complang.tuwien.ac.at/franz/latex-bench, I have a Raspi 3
result (and the Raspi 3 with its 1200MHz 2-wide core is quite a bit
faster than the 700Mhz ARM11), and also some CPUs similar to those
used in the Eee PC; numbers are times in seconds:

- Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
- Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89
- Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
- AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216

So all of these CPUs clearly beat the one in the Raspi3, which I
expect to be clearly faster than the ARM11.

Now imagine running the software that made the Eee PC so slow with
dynamic translation on a Raspi1. How slow would that be?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 12:01:39 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

I see the difference between CISC and RISC as in the micro-architecture,

But the microarchitecture is not an architectural criterion.

changing from a single sequential state machine view to multiple concurrent >machines view, and from Clocks Per Instruction to Instructions Per Clock.

People changed from talking CPI to IPC when CPI started to go below 1.
That's mainly a distinction between single-issue and superscalar CPUs.

The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.

The same holds true for the MIPS R2000, the ARM1/2 (and probably many successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.

And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).

One can take an Alpha ISA and implement it with a microcoded sequencer
but that should not be called RISC

Alpha is a RISC architecture. So this hypothetical implementation
would certainly be an implementation of a RISC architecture.

RISC changes that design to one like a multi-threaded program with
messages passing between them called uOps, where the dynamic state
of each instruction is mostly carried with the uOp message,
and each thread does something very simple and passes the uOp on.
Where global resources are required, they are temporarily dynamically >allocated to the uOp by the various threads, carried with the uOp,
and returned later when the uOp message is passed to the Retire thread.
The Retire thread is the only one which updates the visible global state.

This does not sound like RISC vs. non-RISC at all, but like OoO microarchitecture, and the contrast would be an in-order execution microarchitecture. Both RISCs and non-RISCs can make use of OoO microarchitectures, and have done so.

The RISC design guidelines described by various papers, rather than
go/no-go decisions, are mostly engineering compromises for consideration
of things which would make an MST-MPA more expensive to implement or >otherwise interfere with maximizing the active concurrency of all threads.

The interesting aspect is that RISCs are easier to implement in simple pipelines like the ones of early ARM, HPPA, MIPS and SPARC
implementations, but can also be implemented as in-order superscalar
or OoO superscalar microarchitectures; you can also implement it as sequentially-executed microcode engine. Wolfgang Kleinert implemented
a microcoded RISC in the 1980s, but I think that it was pipelined.

The advantages from the instruction set diminish with the more complex implementation techniques, and there are a number of instruction set
design decisions in early RISCs that turned out to be not so great and
that were eliminated in later RISCs (if not from the start), most
notably delayed branches, but many of the recent instruction sets (ARM
A64, RISC-V) take many of the same design decisions as the RISC
architectures of the 1980s (load/store, register architecture, etc.,
see John Mashey's criteria and recent discussions about this topic),
whereas many non-RISCs deviate from this design style.

This is why I think it would have been possible to build a risc-style
PDP-11 in 1975 TTL, or a VAX if they had just left the instructions of
the same complexity as PDP-11 ISA (53 opcodes, max one immediate,
max one mem op per instruction).

The PDP-11 instruction set is not RISC, and you paint a picture that
is too rosy: It has up to two mem ops per instruction, and IIRC even memory-indirect addressing modes. Not a problem for the
physically-addressed first implementations, nasty as soon as you add
virtual memory.

Implementing a pipelined implementation of PDP-11 (like the 486 was
for IA-32) for PDP-11 would have been quite a bit harder than for the
486 (admittedly the 486 has to deal with 16-bit modes and other legacy features, so it's not the easiest target, either).

For the VAX I would go for a RISC instead of a cleaned-up IA-32-like instruction set, and then implement pipelining. I would rather put
the effort in implementing compressed instructions rather than
load-and-op or RMW instructions.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Sep 20 13:10:49 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> wrote:

On 9/19/2025 9:33 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently
higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...

Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

A lot of the ARM SoC's I had seen had lower TDPs, though more often with Cortex A53 or A55/A78 cores or similar:

Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.

Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

TDP 5W, has A55 and A78 cores.

Some amount of the HiSilicon numbers look similar...

But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W

So, more like 10x here, but ...

Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.

Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks
to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to about 50000 DMIPS. Dhrystone contain string operations which benefit
from SSE/AVX, but I would expect that on media load speed ratio would
be even more favourable to desktop core. On jumpy code ratio is probably lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.

It is hard to compare performance per watt: Orange Pi Zero 3 has low
power draw (of order 100 mA from 5V USB charger with one core active) and
it is not clear how it is distributed between CPU-s and Etherent interface. RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
normally seem to run at at fraction of rated power too (but I have
no way to directly measure CPU power draw).

Of course, there is a catch: desktop CPU is made on more advanced
process than small processors. So it is hard to separate effects
from architecture and from the process.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Sep 20 19:32:17 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

I see the difference between CISC and RISC as in the micro-architecture,

But the microarchitecture is not an architectural criterion.

changing from a single sequential state machine view to multiple concurrent >> machines view, and from Clocks Per Instruction to Instructions Per Clock.

People changed from talking CPI to IPC when CPI started to go below 1.
That's mainly a distinction between single-issue and superscalar CPUs.

The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.

The same holds true for the MIPS R2000, the ARM1/2 (and probably many successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.

And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).

Maybe relevant:

Performance optimizers writing asm regularly hit that 1 IPC on the 486
and (with more difficulty) 2 IPC on the Pentium.

When we did get there, the final performance was typically 3X compiled C
code.

That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
the PPro and later OoO CPUs.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 20 17:38:19 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

I see the difference between CISC and RISC as in the micro-architecture,

But the microarchitecture is not an architectural criterion.

changing from a single sequential state machine view to multiple concurrent >>> machines view, and from Clocks Per Instruction to Instructions Per Clock. >>

People changed from talking CPI to IPC when CPI started to go below 1.
That's mainly a distinction between single-issue and superscalar CPUs.

The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.

The same holds true for the MIPS R2000, the ARM1/2 (and probably many
successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.

And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).

Maybe relevant:

Performance optimizers writing asm regularly hit that 1 IPC on the 486
and (with more difficulty) 2 IPC on the Pentium.

When we did get there, the final performance was typically 3X compiled C code.

That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
the PPro and later OoO CPUs.

And then came back with SIMD, I presume? :-)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Sep 20 22:01:27 2025

From Newsgroup: comp.arch

On Sat, 20 Sep 2025 07:56:49 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Yes, organizing the interconnect in a hierarchical way can help reduce
the increase in interconnect cost, but I expect that there is a reason
why Intel did not do that for its server CPUs with P-Cores, by e.g.,
forming clusters of 4, and then continuing with the ring; instead,
they opted for a grid interconnect.

- anton

I don't know for sure, but would imagine that the reason is that their
server CPUs with P-core have the same design for low-to-mid end "cloud"
models and for high-end "enterpise" models. High-end models have OLTP
and similar enterprise workloads as rather important market. Flatter
LLC is better for OLTP/enterprise than dozen or two of separate L3
caches. Besides, their current L2 caches are rather big, so if they
make those separate L3s true exclusive, which is optimal for reduction
of cc traffic, then there would be rather big waste of total cache
capacity.

An alternative is to left LLC intact and instead make L2s shared by
pairs of cores. That is unacceptable because of yet another market
addressed by the same Xeons line - computations/HPC, where being
limited by L2 bandwidth is not rare even now. With shared L2 it will
become very common.

3 different uncore designs for 3 different markets can solve that
nicely, but of course in the Intel's current financial situation that
is unthinkable. Probably even current arrangement with 3 Xeon lines
(Xeon-E = desktop chips with E-cores fused off, Seirra Forrest = plenty
of Crestmont cores and "normal" Xeons currently represented by Granite
Rapids) could be unsustainable.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Sep 20 21:14:23 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

I see the difference between CISC and RISC as in the micro-architecture, >>>

But the microarchitecture is not an architectural criterion.

changing from a single sequential state machine view to multiple concurrent
machines view, and from Clocks Per Instruction to Instructions Per Clock. >>>

People changed from talking CPI to IPC when CPI started to go below 1.
That's mainly a distinction between single-issue and superscalar CPUs.

The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX, >>>> 386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.

The same holds true for the MIPS R2000, the ARM1/2 (and probably many
successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.

And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).

Maybe relevant:

Performance optimizers writing asm regularly hit that 1 IPC on the 486
and (with more difficulty) 2 IPC on the Pentium.

When we did get there, the final performance was typically 3X compiled C
code.

That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
the PPro and later OoO CPUs.

And then came back with SIMD, I presume? :-)

Sure!

I typically got 3X SIMD speedup from 4-way processing, years before any compilers were able to autovectorize to again partly close the gap.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Sep 20 16:10:40 2025

From Newsgroup: comp.arch

On 9/20/2025 8:10 AM, Waldek Hebisch wrote:

BGB <cr88192@gmail.com> wrote:

On 9/19/2025 9:33 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently
higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...

Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >>>> slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

A lot of the ARM SoC's I had seen had lower TDPs, though more often with
Cortex A53 or A55/A78 cores or similar:

Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.

Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

TDP 5W, has A55 and A78 cores.

Some amount of the HiSilicon numbers look similar...

But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W

So, more like 10x here, but ...

Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.

Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to about 50000 DMIPS. Dhrystone contain string operations which benefit
from SSE/AVX, but I would expect that on media load speed ratio would
be even more favourable to desktop core. On jumpy code ratio is probably lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.

It is hard to compare performance per watt: Orange Pi Zero 3 has low
power draw (of order 100 mA from 5V USB charger with one core active) and
it is not clear how it is distributed between CPU-s and Etherent interface. RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
normally seem to run at at fraction of rated power too (but I have
no way to directly measure CPU power draw).

Of course, there is a catch: desktop CPU is made on more advanced
process than small processors. So it is hard to separate effects
from architecture and from the process.

I had noted before that when I compiled Dhrystone on my Ryzen using
MSVC, it is around 10M, or 5691 DMIPs, or around 1.53 DMIPs/MHz.

Curiously, the score is around 4x higher (around 40M) if Dhrystone is
compiled with GCC (and around 2.5x with Clang).

For most other things, the performance scores seem closer.

I don't really trust GCC's and Clang's Dhrystone scores as they seem
basically out-of-line with most other things I can measure.

Noting my BJX2 core seems to perform at 90K at 50MHz, or 1.02 DMIPS/MHz.
If assuming MSVC as the reference, this would imply (after normalizing
for clock-speeds) that the Ryzen only gets around 50% more IPC.

I noted when compiling my BJX2 emulator:
My Ryzen can emulate it at roughly 70MHz;
My cell-phone can manage it at roughly 30MHz.

This isn't *that* much larger than the difference in CPU clock speeds.

It is like, I seemingly live in a world where a lot of my own benchmark attempts tend to be largely correlated with the relative different in
clock speeds and similar.

Well, except for my old laptop (from 2003), and an ASUS Eee, which seem
to perform somewhat below that curve.

Though, in the case of the laptop, it may be a case of not getting all
that much memory bandwidth from a 100MHz DDR1 SO-DIMM (a lot of the performance on some tests seems highly correlated with "memcpy()"
speeds, and on that laptop, its memcpy speeds are kinda crap if compared
with CPU clock-speed).

Well, and the Eee has, IIRC, an Intel Atom N270 down-clocked to 630 MHz.
Thing ran Quake and Quake 2 pretty OK, but not much else.

Though, if running the my emulator on the laptop, it is more back on the
curve of relative clock-speed, rather than on the
relative-memory-bandwidth curve.

It seems both my neural-net stuff and most of my data compression stuff,
more follow the memory bandwidth curve (though, for the laptop, it seems
NN stuff can get a big boost here by using BFloat16 and getting a little clever with the repacking).

Well, and then my BJX2 core seems to punch slightly outside its weight
class (MHz wise) by having disproportionately high memory bandwidth.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Sep 20 22:01:48 2025

From Newsgroup: comp.arch

On 9/20/2025 6:48 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 9/19/2025 4:50 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

And, for many uses, performance is "good enough";

In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.

Possibly, it depends.

The question is what could Intel or AMD do if the wind blew in that
direction.

What direction?

In some directly where emulating x86 on in-order cores was preferable to having x86 in hardware...

May or may not be "extreme budget".

Though, I am writing this after having to battle for a while to get
"boot magic" out of a Dell OptiPlex that I got on Amazon for $80.

Turned out the UEFI BIOS was not installed correctly on the PC, which
was effectively "utterly helpless" without it.

Had to use a Dell tool to make an installer image on a USB thumb-drive,
to get a bootable BIOS to configure the thing into a form where it could actually boot (where, it could then apparently install the BIOS config
UI from the USB drive). Apparently no support for Legacy Boot, option
was listed by just sort of grayed out and could not be selected
(apparently no TPM either, so can't run Win 11).

But, for $80, could get something with a Core i3 and a 500GB HDD.
Case was only really designed to handle a 2.5" drive, no space to fit a
3.5" HDD.

Going much cheaper, it apparently crosses from HDD into "eMMC Flash" territory, but "64GB eMMC Flash" was maybe a little too budget.

There were also some options with M.2, but I wanted SATA. At least, in
theory, with SATA once can swap HDDs if needed, but this is seemingly
hindered if the firmware is so limited as to be rendered helpless if if
can't load it from the HDD.

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance >>>> on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.

Evidence?

No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).

The RasPi basically runs circles around the Eee...

That's probably a software problem. Different Eee PC models have
different CPUs, Celeron M @571Mhz, 900MHz, or 630MHz, Atoms with 1330-1860Mhz, or AMD C-50 or E350. All of them are quite a bit faster
than the 700Mhz ARM11. While I don't have a Raspi1 result on https://www.complang.tuwien.ac.at/franz/latex-bench, I have a Raspi 3
result (and the Raspi 3 with its 1200MHz 2-wide core is quite a bit
faster than the 700Mhz ARM11), and also some CPUs similar to those
used in the Eee PC; numbers are times in seconds:

- Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
- Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89
- Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
- AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216

So all of these CPUs clearly beat the one in the Raspi3, which I
expect to be clearly faster than the ARM11.

IIRC, I was running Debian on the Eee (IIRC because the Xandros it came
with was kinda useless).

The one I have being one of the 701 variants (would need to find it
again to know the model). Looking online, it was probably one of the underclocked Celeron models though.

Not sure how fast (or not fast) it was, but it was basically about
enough to run Quake and Quake 2 in 640x480, but was hard pressed to do
much more than this (and be playable).

Trying to use Firefox or similar on it was just kinda painful.

Now imagine running the software that made the Eee PC so slow with
dynamic translation on a Raspi1. How slow would that be?

Seemingly the RasPi could run Quake OK in 800x600 though...
And, also did well working with CRAM video.

By other subjective measures, at least the GUI on the RasPi didn't
behave like molasses.

So, in any case, a better user experience at least (with some
uncertainty as to the actual speed).

Granted, might have been relevant to time running GCC builds or similar
for a more objective measure, would need to find both.

Though, at least, an emulator would need to be faster than DOSBox, as
DOSBox on RasPi tends to be too slow to even really run Doom or similar.

My cellphone at least gave a slightly better experience running DOSBox
(well, except that DOSBox and Termux on Android occasionally forget all
of their local storage and get reverted to their default contents).

RasPi+DOSBox can at least seemingly run Windows 3.11 and similar though.

Though, AFAIK DOSBox on ARM is running purely as an interpreter.

I remember though that one time I did try doing custom code generation
on the RasPi, and performance was terrible. At the time it seemed like
there was some "secret sauce" that GCC had to not get terrible performance.

Though, IIRC, this was a fork where I had tried to modified BGBCC's
SuperH backend to be able to target Thumb2.

Or, seeming informal/subjective ranking (mostly from memory):

Eee (CPU = something slow):
Quake 2, 640x400, OK-ish
Quake 3, N/A, didn't work
(No memcpy score or formal benchmarks)

Laptop from 2003 (1.4GHz Athlon, of some variant):
Quake 1/2: 1024x768, runs well.
(1024x768 is max resolution of LCD).
Quake 3: Also runs well.
As did GLQuake and Quake2 in OpenGL.
Half-Life runs well.
Half-Life 2, ran but poorly.
Gets around 400MB/s in a memcpy benchmark.
DDR1 100 MHz (or, DDR-200)
Notably lower than theoretical bandwidth.
(No values for LZ4 or CRAM tests IIRC)

RasPi 1 (700 MHz ARM11):
Quake 800x600 runs OK.
Quake 3: Ran, but poorly.
Gets around 1.2 GB/sec in memcpy.
Around 300 MB/s LZ4 decode
Around 400 Mpix/sec in CRAM decode.

RasPi 3 (1400 MHz 4x A53):
Quake 1/2/3 and GLQuake and Q3A run well.
Gets around 1.6 GB/sec in memcpy.
Around 500 MB/s LZ4 decode
Around 700 Mpix/sec in CRAM decode.

Laptop from 2009 (2.1 GHz Core 2, 2 cores):
Quake 1/2 and Half-Life are 60 fps at max resolution (1440x900).
In SW rendering only.
It was a very good option if you were OK with software rendering.
Quake 3: Around 20 fps.
GLQuake and Quake3 perform like dog crap.
GPU: Intel GMA X3100
Half-Life 2: Also very poor.
Minecraft ran, but unplayable.
Even on lowest draw distance.
Doom 3, started up at least...
Severe graphical glitches (lighting didn't work correctly)
Dead slow.
Around 2.4 GB/sec in memcpy.
Around 2.0 GB/s in LZ4
Around 1500 Mpix/sec in CRAM decode.
Performs well in CPU based tasks.
OpenGL via Software rasterization almost as fast as the GPU.

Current PC (Ryzen 2700X, 3.7GHz, 8C16T)
No issues running any of these games.
Memcpy: 3.6 GB/sec.
DDR4-2133
Around 3.2 GB/sec in LZ4
Around 2000 Mpix/sec in CRAM decode.

As can be noted:
memcpy tests tend to measure lower than RAM bandwidth.
CRAM decode often tends to exceed memcpy.
My mempy and LZ4 tests are single threaded.
Multi-threading can often give higher total bandwidth.

The bulk of time in CRAM decoding is spent in logic like:
tab[0]=colorA;
tab[1]=colorB;
px0=tab[(pix>>0)&1]; px1=tab[(pix>>1)&1];
px2=tab[(pix>>2)&1]; px3=tab[(pix>>3)&1];
ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;
ct+=stride;
px0=tab[(pix>>4)&1]; px1=tab[(pix>>5)&1];
px2=tab[(pix>>6)&1]; px3=tab[(pix>>7)&1];
ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;
ct+=stride;
px0=tab[(pix>> 8)&1]; px1=tab[(pix>> 9)&1];
px2=tab[(pix>>10)&1]; px3=tab[(pix>>11)&1];
ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;
ct+=stride;
px0=tab[(pix>>12)&1]; px1=tab[(pix>>13)&1];
px2=tab[(pix>>14)&1]; px3=tab[(pix>>15)&1];
ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Sep 21 16:20:00 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> wrote:

On 9/20/2025 8:10 AM, Waldek Hebisch wrote:

BGB <cr88192@gmail.com> wrote:

On 9/19/2025 9:33 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently >>>> higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...

Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >>>>> slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

A lot of the ARM SoC's I had seen had lower TDPs, though more often with >>> Cortex A53 or A55/A78 cores or similar:

Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.

Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

TDP 5W, has A55 and A78 cores.

Some amount of the HiSilicon numbers look similar...

But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W

So, more like 10x here, but ...

Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.

Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks >> to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to >> about 50000 DMIPS. Dhrystone contain string operations which benefit
from SSE/AVX, but I would expect that on media load speed ratio would
be even more favourable to desktop core. On jumpy code ratio is probably
lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.

It is hard to compare performance per watt: Orange Pi Zero 3 has low
power draw (of order 100 mA from 5V USB charger with one core active) and
it is not clear how it is distributed between CPU-s and Etherent interface. >> RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
normally seem to run at at fraction of rated power too (but I have
no way to directly measure CPU power draw).

Of course, there is a catch: desktop CPU is made on more advanced
process than small processors. So it is hard to separate effects
from architecture and from the process.

I had noted before that when I compiled Dhrystone on my Ryzen using
MSVC, it is around 10M, or 5691 DMIPs, or around 1.53 DMIPs/MHz.

Curiously, the score is around 4x higher (around 40M) if Dhrystone is compiled with GCC (and around 2.5x with Clang).

For most other things, the performance scores seem closer.

I don't really trust GCC's and Clang's Dhrystone scores as they seem basically out-of-line with most other things I can measure.

I would not totally dismiss Dhrystone scores. Apparently Dhrystone
allows more optimizations than other programs. There may be bias,
because GCC and Clang developers select optimizations to improve
benchark scores. But AFAICS compiled code performs work it should
do. And the work correspond to typical work mix from the past.
More important, optimizations on gcc are mostly independent of
architecture, so essentially the same optimizations are applied
on all machines.

BTW: I get similar Dhrystone results from GCC Clang (differences of
few percent or less).

Concerning other loads, my current desktop (12 cores) build a medium size program about 8.5 times faster than 4 core Core 2 from 2008. There
is non-negilgable serial part in the build, so single modern core is about
3 times faster than single core in Core 2. I do not have comparable
results for 64-bit Orange Pi, but on slow machines I see build times
that are 40 times longer. Big part is numebr of cores, hypertheading
helps too (real time using 20 jobs is significanty smaller than real
time using 12 jobs). But clearly single big core is significanlty
faster than smaller cores.

Part of advantage of big core is due to big caches, my understanding
is that smaller processors that I use have much smaller caches.

Noting my BJX2 core seems to perform at 90K at 50MHz, or 1.02 DMIPS/MHz.
If assuming MSVC as the reference, this would imply (after normalizing
for clock-speeds) that the Ryzen only gets around 50% more IPC.

I noted when compiling my BJX2 emulator:
My Ryzen can emulate it at roughly 70MHz;
My cell-phone can manage it at roughly 30MHz.

This isn't *that* much larger than the difference in CPU clock speeds.

It is like, I seemingly live in a world where a lot of my own benchmark attempts tend to be largely correlated with the relative different in
clock speeds and similar.

Well, clock speeds is major factor for power efficiency. Running CPU
and lower clock freqency significanlty lowers energy per instruction.
And mere capability to run at high clock freqency causes increased
power use at lower clock freqencies (IIUC high freqency may need
bigger transistors and/or more transistors).

Well, except for my old laptop (from 2003), and an ASUS Eee, which seem
to perform somewhat below that curve.

Though, in the case of the laptop, it may be a case of not getting all
that much memory bandwidth from a 100MHz DDR1 SO-DIMM (a lot of the performance on some tests seems highly correlated with "memcpy()"
speeds, and on that laptop, its memcpy speeds are kinda crap if compared with CPU clock-speed).

Well, and the Eee has, IIRC, an Intel Atom N270 down-clocked to 630 MHz.
Thing ran Quake and Quake 2 pretty OK, but not much else.

Though, if running the my emulator on the laptop, it is more back on the curve of relative clock-speed, rather than on the
relative-memory-bandwidth curve.

It seems both my neural-net stuff and most of my data compression stuff, more follow the memory bandwidth curve (though, for the laptop, it seems
NN stuff can get a big boost here by using BFloat16 and getting a little clever with the repacking).

Well, and then my BJX2 core seems to punch slightly outside its weight
class (MHz wise) by having disproportionately high memory bandwidth.

...

--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Mon Sep 22 03:21:17 2025

From Newsgroup: comp.arch

In article <bp4jck19kcmq4i571fiofcrk1k6nn9k0ha@4ax.com>,
George Neuner <gneuner2@comcast.net> wrote:

On Tue, 16 Sep 2025 00:03:51 -0000 (UTC), John Savard ><quadibloc@invalid.invalid> wrote:

On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

On further reflection, this may be equivalent to re-inventing out-of-order >>execution.

John Savard

Sounds more like dynamic micro-threading.

Over the years I've seen a handful of papers about compile time >micro-threading: that is the compiler itself identifies separable
dependency chains in serial code and rewrites them into deliberate
threaded code to be executed simultaneously.

It is not easy to do under the best of circumstances and I've never
seen anything about doing it dynamically at run time.

To make a thread worth rehosting to another core, it would need to be
(at least) many 10s of instructions in length. To figure this out >dynamically at run time, it seems like you'd need the decode window to
be 1000s of instructions and a LOT of "figure-it-out" circuitry.

MMV, but to me it doesn't seem worth the effort.

I began reading the patent, and it's not clear to me this approach is
going to be much of an improvement. A great deal of analysis magic has
to happen to find code to spread across the cores. To summarize, it's basically taking code that looks like:

for(i = 0; i < N; i++) {
// Do some work
}

for(i = 0; i < M; i++) {
// Do some different work
}

and have two cores run the loops at the same time, with some special
check hardware to make sure they really are dependent (I gave up before
really figuring out what they're going to do, patents are not fun to read).
I think they actually want to divide up each loop into sections, and do
them in parallel. If someone wanted to explain in better detail what
they are doing, I'd like to read that short summary in non-patentese.

A trivial alternative approach to shrinking core size while not losing
single thread speed is to basically make all cores Narrow (meaning
support something like 4 instructions wide), and when code needs more,
stall the neighboring core and steal it's functional units to form a new
8-wide core. This approaches the SMT hardware sharing in a different direction, and so code without much instruction parallelism will run
better on two smaller cores than on a big core with two threads, but if
a single thread can use 8-wide instruction execution, it can steal it from
the neighboring core for a while.

If that's too much trouble, then for x86, all cores have just AVX-256 width, and take two clocks to do each AVX-512 operation (which is still better than just AVX-256). But hardware can join the neighboring cores together to be AVX-512, with each AVX-512 op taking just one clock now (and this can just
be AVX, the other core can run other instructions unimpeded).

Kent
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Sep 22 11:28:13 2025

From Newsgroup: comp.arch

But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Sep 22 20:28:33 2025

From Newsgroup: comp.arch

On 22/09/2025 17:28, Stefan Monnier wrote:

But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.

Yes, I think that is correct.

A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more efficient
at general integer work and other common actions, as a result of a
better designed instruction set and register set. But once you are
using slightly more specific hardware features - vector processing,
floating point, acceleration for cryptography, etc., it's all much the
same. It takes roughly the same energy to do these things regardless of
the instruction set. Cache memory takes about the same power, as do PCI interfaces, memory interfaces, and everything else that takes up power
on a chip.

So when you have a relatively small device - such as what you need for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than x86. (If you go
smaller - small embedded systems - x86 is totally non-existent because
an x86 microcontroller would be an order of magnitude bigger, more
expensive and power-consuming than an ARM core.) But when you have big processors for servers, and are using a significant fraction of the processor's computing power, the details of the core matter a lot less.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Sep 22 19:36:05 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 22/09/2025 17:28, Stefan Monnier wrote:

But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.

Yes, I think that is correct.

A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more efficient
at general integer work and other common actions, as a result of a
better designed instruction set and register set. But once you are
using slightly more specific hardware features - vector processing,
floating point, acceleration for cryptography, etc., it's all much the
same. It takes roughly the same energy to do these things regardless of
the instruction set. Cache memory takes about the same power, as do PCI interfaces, memory interfaces, and everything else that takes up power
on a chip.

So when you have a relatively small device - such as what you need for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than x86. (If you go smaller - small embedded systems - x86 is totally non-existent because
an x86 microcontroller would be an order of magnitude bigger, more
expensive and power-consuming than an ARM core.) But when you have big processors for servers, and are using a significant fraction of the processor's computing power, the details of the core matter a lot less.

Big servers have rather equal power in the peripherals {DISKs, SSDs, and
NICs} and DRAM {plus power supplies and cooling} than in the cores.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Sep 23 08:24:54 2025

From Newsgroup: comp.arch

On 22/09/2025 21:36, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 22/09/2025 17:28, Stefan Monnier wrote:

But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.

Yes, I think that is correct.

A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more efficient
at general integer work and other common actions, as a result of a
better designed instruction set and register set. But once you are
using slightly more specific hardware features - vector processing,
floating point, acceleration for cryptography, etc., it's all much the
same. It takes roughly the same energy to do these things regardless of
the instruction set. Cache memory takes about the same power, as do PCI
interfaces, memory interfaces, and everything else that takes up power
on a chip.

So when you have a relatively small device - such as what you need for a
mobile phone - the instruction set and architecture makes a significant
difference and ARM is a lot more power-efficient than x86. (If you go
smaller - small embedded systems - x86 is totally non-existent because
an x86 microcontroller would be an order of magnitude bigger, more
expensive and power-consuming than an ARM core.) But when you have big
processors for servers, and are using a significant fraction of the
processor's computing power, the details of the core matter a lot less.

Big servers have rather equal power in the peripherals {DISKs, SSDs, and NICs} and DRAM {plus power supplies and cooling} than in the cores.

Yes, all that will be independent of the type of cpu core.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 24 21:08:10 2025

From Newsgroup: comp.arch

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

David Brown <david.brown@hesbynett.no> posted:

On 22/09/2025 17:28, Stefan Monnier wrote:

But, AFAIK the ARM cores tend to use significantly less power
when emulating x86 than a typical Intel or AMD CPU, even if
slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most
of them run software that's just as easily available for ARM.
And many datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.

Yes, I think that is correct.

A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more
efficient at general integer work and other common actions, as a
result of a better designed instruction set and register set. But
once you are using slightly more specific hardware features -
vector processing, floating point, acceleration for cryptography,
etc., it's all much the same. It takes roughly the same energy to
do these things regardless of the instruction set. Cache memory
takes about the same power, as do PCI interfaces, memory
interfaces, and everything else that takes up power on a chip.

So when you have a relatively small device - such as what you need
for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than
x86. (If you go smaller - small embedded systems - x86 is totally non-existent because an x86 microcontroller would be an order of
magnitude bigger, more expensive and power-consuming than an ARM
core.) But when you have big processors for servers, and are using
a significant fraction of the processor's computing power, the
details of the core matter a lot less.

Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.

Still, CPU power often matters.
Spec.org has special benchmark for that called SPECpower_ssj 2008.
It is old and java-oriented but I don't think that it is useless.

Right now the benchmark clearly shows that AMD offferings dominate
Intel's.
The best AMD score is 44168 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2025q2/power_ssj2008-20250407-01522.html

The best Intel score are 25526 ssj_ops/watt (Sierra Forest) and 25374 ssj_ops/watt (Granite Rapids). Both lag behind ~100 AMD scores,
They barely beats some old EPYC3 scores from 2021. https://www.spec.org/power_ssj2008/results/res2025q3/power_ssj2008-20250811-01533.html
https://www.spec.org/power_ssj2008/results/res2025q1/power_ssj2008-20250310-01505.html

There are very few non-x86 submissions. The only one that I found in
last 5 years was using Nvidia Grace CPU Superchip based on Arm Inc.
Neoverse V2 cores. It scored 13218 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2024q3/power_ssj2008-20240515-01413.html

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Wed Sep 24 15:56:37 2025

From Newsgroup: comp.arch

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them. At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Sep 24 20:00:07 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

David Brown <david.brown@hesbynett.no> posted:

On 22/09/2025 17:28, Stefan Monnier wrote:

But, AFAIK the ARM cores tend to use significantly less power
when emulating x86 than a typical Intel or AMD CPU, even if
slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most
of them run software that's just as easily available for ARM.
And many datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM nearly as significantly as you think.

Yes, I think that is correct.

A lot of it, as far as I have read, comes down to the type of calculation you are doing. ARM cores can often be a lot more
efficient at general integer work and other common actions, as a
result of a better designed instruction set and register set. But
once you are using slightly more specific hardware features -
vector processing, floating point, acceleration for cryptography,
etc., it's all much the same. It takes roughly the same energy to
do these things regardless of the instruction set. Cache memory
takes about the same power, as do PCI interfaces, memory
interfaces, and everything else that takes up power on a chip.

So when you have a relatively small device - such as what you need
for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than
x86. (If you go smaller - small embedded systems - x86 is totally non-existent because an x86 microcontroller would be an order of magnitude bigger, more expensive and power-consuming than an ARM
core.) But when you have big processors for servers, and are using
a significant fraction of the processor's computing power, the
details of the core matter a lot less.

Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.

Still, CPU power often matters.
Spec.org has special benchmark for that called SPECpower_ssj 2008.
It is old and java-oriented but I don't think that it is useless.

Right now the benchmark clearly shows that AMD offferings dominate
Intel's.
The best AMD score is 44168 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2025q2/power_ssj2008-20250407-01522.html

The best Intel score are 25526 ssj_ops/watt (Sierra Forest) and 25374 ssj_ops/watt (Granite Rapids). Both lag behind ~100 AMD scores,
They barely beats some old EPYC3 scores from 2021. https://www.spec.org/power_ssj2008/results/res2025q3/power_ssj2008-20250811-01533.html
https://www.spec.org/power_ssj2008/results/res2025q1/power_ssj2008-20250310-01505.html

There are very few non-x86 submissions. The only one that I found in
last 5 years was using Nvidia Grace CPU Superchip based on Arm Inc.
Neoverse V2 cores. It scored 13218 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2024q3/power_ssj2008-20240515-01413.html

A quick survey of the result database indicates only Oracle is
sending results to the data base.

Would be interesting to see the Apple/ARM comparisons.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 24 23:37:17 2025

From Newsgroup: comp.arch

On Wed, 24 Sep 2025 20:00:07 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

A quick survey of the result database indicates only Oracle is
sending results to the data base.

You misread it.
The organization that submits result is listed as "Test Sponsor".
Oracle is a sponsore of none of results that I listed in my previous
post.
The sponsors are ASUSTeK Computer Inc, New H3C Technologies Co, Lenovo
Global Technology and Infobell IT Solutions Pvt.

The most recent submissions are by Dell and Lenovo. https://www.spec.org/power_ssj2008/results/res2025q3/

Would be interesting to see the Apple/ARM comparisons.

Would be very interesting, but not going to happen.
Last time Apple submitted something to spec.org was almost 20 yers ago.
And it never submitted to Spec Power SSJ, which sort of makes sense -
this is a benchmark designed for severs and Apple does not sell servers.

The ARM architectecture vendor with highest number of submissions to
spec.org is Ampere, but they abondoned Arm-designed cores couple of
years ago and now shipping Arm architecture CPUs with cores of their
own design.
However there are few results in the database that use their previous
offerings based on Arm Neovese-N1 cores. Here is the best result: https://www.spec.org/power_ssj2008/results/res2024q1/power_ssj2008-20231104-01332.html

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 24 23:48:50 2025

From Newsgroup: comp.arch

On Wed, 24 Sep 2025 15:56:37 -0400
George Neuner <gneuner2@comcast.net> wrote:

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs,
SSDs, and NICs} and DRAM {plus power supplies and cooling} than in
the cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.

I think that it's less than 80%. But it does not matter and does not
change anything - power spent for coooling is approximately
proportional to power spent for runninng.

At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).

I don't think that you have scientific study to support your claims.

That's before than I state the obvious - even if you were correct about
main RAM consuming more power than CPU (which I doubt very much), still different CPUs can perform the same job with very different number of
main RAM accesses.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Sep 24 21:04:03 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Wed, 24 Sep 2025 15:56:37 -0400
George Neuner <gneuner2@comcast.net> wrote:

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs,
SSDs, and NICs} and DRAM {plus power supplies and cooling} than in
the cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.

I think that it's less than 80%. But it does not matter and does not
change anything - power spent for coooling is approximately
proportional to power spent for runninng.

At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).

A typical 16GB dimm module will dissipate 3-5 watts. So 128GB will
draw in the vincinity of 32 watts. The TDP for a high-end
xeon may exceed 350 watts, Diamond Rapids may exceed 500 watts.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 25 00:21:02 2025

From Newsgroup: comp.arch

On Wed, 24 Sep 2025 21:04:03 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

<...>

Scott,
When you answer George Neuner's point, can you, please, reply to George Neuner's post rather than to mine?

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Sep 24 21:27:09 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 24 Sep 2025 15:56:37 -0400
George Neuner <gneuner2@comcast.net> wrote:

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.

In the old times I heard that they used about as much power for
cooling as goes into the machines. In recent times, I have heard
about success stories where they use less. <https://en.wikipedia.org/wiki/Coefficient_of_performance> says: "Most
air conditioners have a COP of 3.5 to 5", i.e., quite a bit less
energy is expended on cooling than is moved away.

At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).

Where do you get this from?

A typical 16GB dimm module will dissipate 3-5 watts. So 128GB will
draw in the vincinity of 32 watts.

We have several machines with 128GB RAM. They idle at around 40W, and
a box with less RAM and otherwise the same hardware does not idle at
much lower power consumption. The RAM has no active cooler, no
passive cooler, and sits close to each other, so it cannot dissipate
lots of power, certainly not 32W.

By contrast, the CPUs on these machines have elaborate active cooling solutions, and consume 105W TDP (142W power limit).

SSDs are also unlikely to be consuming a lot of power, given the kind
of cooling that they get. Yes, there are elaborate coolers for
M.2-format SSDs, but these is not the kind of format that the bigger
servers use (which rather use U.2 or U.3 SSDs), and even with M.2,
there is usually no need to use SSD cooling.

Maybe if you have a huge number of SSDs, power consumption may rival
that of the CPU.

- antn
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Sep 24 18:38:06 2025

From Newsgroup: comp.arch

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.

Is it really *that* inefficient? Sounds even more horrible than what
I'd expect. Do you have some reference?

At the same time, most of the heat generated by typical systems is due
to the RAM - not the CPU(s).

Even if we consider "CPUs" their power consumption can go much further
than just that of the cores. I remember reading about Threadripper
spending about half its power in the its interconnect.
Still, I suspect you need a lot of RAM before it starts consuming more
power than your CPUs (at least the kind of RAM you find in gaming
desktops consume significantly less than the CPU, last I checked), so it
likely depends on the workloads that are targeted.

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 25 14:23:04 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Wed, 24 Sep 2025 21:04:03 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

<...>

Scott,
When you answer George Neuner's point, can you, please, reply to George >Neuner's post rather than to mine?

The attributions are there, as are the appropriate indentation markers ('>').

Once I've read an article and restarted my newsreader, I don't have access
to read articles (at least not easily).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 25 17:49:13 2025

From Newsgroup: comp.arch

On Thu, 25 Sep 2025 14:23:04 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Once I've read an article and restarted my newsreader, I don't have
access to read articles (at least not easily).

Does not it suck?

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (M. Anton Ertl) to comp.arch on Thu Sep 25 15:28:56 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

Once I've read an article and restarted my newsreader, I don't have access
to read articles (at least not easily).

I press the "Goto parent" button, and I think that already existed in
xrn-9.03, which you use; maybe you need to configure it, or use the
shortcut if one exists. The only problem is that if the parent is
read, but an ancestor article is unread, it will skip the parent and
go to that ancestor. If I ever find the time, I will fix that and
send a patch to Jonathan Kamens.

- anton
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 25 15:37:49 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Thu, 25 Sep 2025 14:23:04 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Once I've read an article and restarted my newsreader, I don't have
access to read articles (at least not easily).

Does not it suck?

Not really. I've been using the same client since 1989; I'm used to it.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 25 15:41:30 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (M. Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Once I've read an article and restarted my newsreader, I don't have access >>to read articles (at least not easily).

I press the "Goto parent" button, and I think that already existed in >xrn-9.03,

yes, it has always existed, and yes, I can use it, but it is quite
slow over NNTP. As the quoting is always accurate,
I generally don't feel it is necessary in the case that Michael
complained about.

I can also hand-edit ~/.newsrc to see older articles, but seldom
have the need.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Sep 25 23:16:00 2025

From Newsgroup: comp.arch

George Neuner wrote:

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them. At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).

I am quite sure that number is simply bogus: The power factors we were
quoted when building the largest new datacenter in Norway 10+ years ago,
was more like 6-10% of total power for cooling afair.

.. a quick google...

https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

This one claims a 1.07 Power Usage Effectiveness.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 25 23:48:19 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

George Neuner wrote:

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them. At the same time, most of the heat generated by typical systems is due to the RAM - not the CPU(s).

I am quite sure that number is simply bogus: The power factors we were quoted when building the largest new datacenter in Norway 10+ years ago,
was more like 6-10% of total power for cooling afair.

. a quick google...

https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

This one claims a 1.07 Power Usage Effectiveness.

All of this depends on where the "cold sink" is !! and how cold it is.

Pumping 6ºC sea water through water to air heat exchangers is a lot
more power efficient than using FREON and dumping the heat into 37ºC
air.

I still suspect that rectifying and delivering clean (low noise) D/C
to the chassis' takes a lot more energy that taking the resulting heat
away.

Flash will have low heat signature
DRAM will have significant heat signature
DISKs will have significant heat signature
GPUs will have significant heat signature
CPUs will have significant heat signature
Motherboard has low-medium heat signature

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Sep 26 02:03:21 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

I am quite sure that number is simply bogus: The power factors we were
quoted when building the largest new datacenter in Norway 10+ years ago,
was more like 6-10% of total power for cooling afair.

. a quick google...

https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

This one claims a 1.07 Power Usage Effectiveness.

All of this depends on where the "cold sink" is !! and how cold it is.

Pumping 6ºC sea water through water to air heat exchangers is a lot
more power efficient than using FREON and dumping the heat into 37ºC
air.

I still suspect that rectifying and delivering clean (low noise) D/C
to the chassis' takes a lot more energy that taking the resulting heat
away.

The FB article above describes how they reduced the
losses due to voltage changes as well as rectification.

Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Sep 25 23:30:27 2025

From Newsgroup: comp.arch

On 9/25/2025 9:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

I am quite sure that number is simply bogus: The power factors we were
quoted when building the largest new datacenter in Norway 10+ years ago, >>> was more like 6-10% of total power for cooling afair.

. a quick google...

https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

This one claims a 1.07 Power Usage Effectiveness.

All of this depends on where the "cold sink" is !! and how cold it is.

Pumping 6ºC sea water through water to air heat exchangers is a lot
more power efficient than using FREON and dumping the heat into 37ºC
air.

I still suspect that rectifying and delivering clean (low noise) D/C
to the chassis' takes a lot more energy that taking the resulting heat
away.

The FB article above describes how they reduced the
losses due to voltage changes as well as rectification.

Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.

Hmm...

Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.

What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors, but
still limiting electrical losses due to wire resistance, while still
avoiding losses due to transformers and rectifiers.

To balance cost and efficiency, could use, say, 8 or 10AWG CCA (copper
clad aluminum) vs 10 or 12AWG copper. Could run the wires at a
relatively lower amperage rating, say:
8A over 10AWG CCA
16A over 8AWG CCA
Or, roughly 1/3 nominal.

Where, CCA wire is a lot cheaper than copper wire, so it is easier to
justify using absurdly thick wire here.

Where, contrast say to running 8A over 20AWG, which works, but a fair
bit more is lost due to heat. Or, the alternative could be to run the
power over parallel thinner wires rather than a single thicker wire. For example, replacing each 10AWG wire with four 14AWG wires.

8A at 192V being 1.5kW, and 8A at 960V being 7.7kW.

Though, assuming a series of 16 racks running on each shared 960V bus,
this would be 128A. The above de-rating scheme would likely make normal
CCA wire impractical. Probably could distribute DC power over a pair of
1.25" aluminum bars or 0.75" to 1.0" copper bars. Likely, the 1.25"
aluminum bar being the cheaper option here.

Could maybe then connect each 10AWG wire to the bars using a clamp,
and/or use an intermediate socket or modular connector.

Does kinda seem a bit overkill though.

Main power distribution would likely need to operate at a higher
voltage, otherwise the building-scale power rails would be absurd here.

Say, if one assumes a monolithic 960VDC system, and 16 rows, this is
2048A. Like, what does one do here, 3" copper or 5" aluminum rails?... Probably no.

Well, or maybe get creative and use large aluminum I-beams that serve
both as power distribution and joists (so, all this metal can serve
additional purpose). Though, 960V through the joists seems like a
building maintain maintenance hazard. Say, for example, 0V through the
floor and 960V through the ceiling.

Input power would likely need multiple transformers and rectifiers to be practical; though admittedly I have little idea here what sorts of
diodes would be used in these rectifiers. Seems like each diode would
itself need to be stupidly large to deal with this crap.

As for cooling, could maybe either use liquid cooling, or hybrid
aid/liquid (say, with superchilled liquid pumped through radiators, and
then fans circle air through these radiators).

To move lots of heat, could maybe use -90C ethanol as a coolant. Where
ethanol can be pumped like water, but could be nearly as cold as Freon.
Would likely still need big refrigeration pumps.

If one could have an artificial lake outside (preferably with a
sun-blocking cover), this could be used as a heat-sink.

Where, say:
Inner loop uses cold ethanol;
Refrigeration system moves heat from ethanol loop to a water loop;
The water loop pumps to/from an artificial lake used as a heat sink.
If the lake is above ambient, it will dissipate heat, but if too much
higher it would suffer evaporation looses.

One idea here could be to have 2 levels of cover over the lake:
The lower one is a metal cover painted black on both sides, placed
roughly 20 inches over the surface of the water;
The second cover is another 20 inches higher, painted black on the lower
side and white on the upper side;
The lower cover has a blocking wall to limit how much water vapor
escapes, whereas the upper barrier is open to the sides (allowing air to
flow through).

As the water evaporates, it moves heat into the barrier, which then
radiates heat (as black-body radiation) where the water condenses and
falls back into the lake;
The upper barrier partly absorbs heat from the lower layer, and also
serves to reflect the sun. Air-flow between the layers can be used to
radiate heat.

One other possibility being to have a tall tapered tube (narrower near
the top) with an open top, with the coolant water in the bottom (with
the tube tube serving to reduce evaporation loss, as water is more
likely to re-condense on the walls and fall back down than to escape the
top). Could likely be made out of steel or similar, maybe black inside,
white outside. Then maybe could heat the coolant water to around 70 or 80C.

While in theory, a giant radiator could work, a sufficiently large
radiator would likely be impractically expensive.

Well, don't know what people actually do, this is just what comes to
mind at the moment.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Sep 26 14:02:31 2025

From Newsgroup: comp.arch

On Thu, 25 Sep 2025 23:16:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

George Neuner wrote:

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs,
SSDs, and NICs} and DRAM {plus power supplies and cooling} than
in the cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them. At the same time, most of the heat generated by typical systems is due to the RAM - not the CPU(s).

I am quite sure that number is simply bogus: The power factors we
were quoted when building the largest new datacenter in Norway 10+
years ago, was more like 6-10% of total power for cooling afair.

.. a quick google...

https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

This one claims a 1.07 Power Usage Effectiveness.

Terje

I think, 1.07 is for 480VAC outside data center building to 48VDC at
server power plug.
It does not include losses withing server
- 48V to mostly 12V by server's PSU
- 12V to the whole zoo of low voltages by on-board DC2DC.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Sep 26 12:10:41 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Sep 26 16:32:59 2025

From Newsgroup: comp.arch

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.

What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.

I never was in big datacenter, but heard that they prefer DC.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Sep 26 14:28:02 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 9/25/2025 9:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.

Hmm...

Brings up a thought: 960VDC is a semi-common voltage in industrial >applications IIRC.

What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Al Kossow@aek@bitsavers.org to comp.arch on Fri Sep 26 07:37:59 2025

From Newsgroup: comp.arch

On 9/26/25 7:28 AM, Scott Lurndal wrote:

In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).

Is it still -48V?
Historically, Bell System plant voltage, supplied by batteries.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Sep 26 15:07:40 2025

From Newsgroup: comp.arch

Al Kossow <aek@bitsavers.org> writes:

On 9/26/25 7:28 AM, Scott Lurndal wrote:

In those datacenters, the UPS distributes 48VDC to the rack components
(computers, network switches, storage devices, etc).

Is it still -48V?
Historically, Bell System plant voltage, supplied by batteries.

Yes. Using a postive ground system reduced corrosion in buried
cabling. While corrosion is not generally an issue for datacenters,
they use the same PDU's that the telcom industry uses.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Sep 26 12:58:43 2025

From Newsgroup: comp.arch

On 9/26/2025 9:28 AM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 9/25/2025 9:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.

Hmm...

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).

OK.

I had thought they were usually 120VAC or 240VAC.

At least, what rack-servers I had encountered were usually one of these (sometimes they had the little switch on the power-supply set to 240V
even in the US).

Then again, can also note that when setting up my milling machine,
lathe, and plasma table, that these were all using 240VAC for the power distribution to the various components. These were all Tormach machines though, so can't say for others.

48VDC also makes sense, as it is common in other contexts. I sorta
figured a higher voltage would have been used to reduce the wire
thickness needed.

Though, I don't actually know how real datacenters work here, just sort
of coming up with something assuming optimizing for the target goals
(powering all this stuff while minimizing electrical losses and cost).

I did realize after posting that, if the main power rails were organized
as a grid, the whole building could be done probably with 1.25" aluminum
bars.

Could power the grid of bars at each of the 4 corners, with maybe some
central diagonal bars (which cross and intersect with the central part
of the grid, and an additional square around the perimeter). Each corner supply could drive 512A, and with this layout, no bar or segment should
exceed 128A.

Assuming if they were using 240VAC, seems like the typical housing setup (12AWG wire) would be woefully insufficient. Would either need to be
heavily built up and/or use much heavier gauge wiring.

Or also solid copper or aluminum bars. Not sure if I had heard of this,
usual idea IIRC was that people always use wire for AC power, except
that if pushing a continuous load of several hundred amps, wire seems
less practical (would need to be very thick, hard to work with, and expensive).

Granted, more likely they would run the cable closer to the rated values
and accept more energy loss due to electrical resistance (since, yeah, a
1.25" bar or similar for 128A is a little excessive).

Though, it seems likely that in this case, solid metal bars might be
cheaper than using a whole lot of heavy gauge wire. And, repurposing
generic aluminum bar-stock might be the cheapest option here (with joins either as aluminum clamps or via welding).

If operating closer to conventional electrical ratings, could drop to
0.375" bars for 128A. Going much thinner, voltage drops and heat would
become an issue.

So, say:
0.250" likely high resistive loss.
0.375" roughly nominal.
0.750" maybe sufficiently low resistance
(could likely handle 500A before significant heat)
1.250" maybe overkill

Well, and could maybe put a plastic coating or similar on the bars to
limit accidental short-circuits. Decided to leave out analysis, but the
most likely option (to balance cost and effectiveness) would likely be a post-install application of acrylic paint (latex paint would be
insufficient, epoxy likely too expensive, ...).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Sep 26 15:23:38 2025

From Newsgroup: comp.arch

On 9/26/2025 8:32 AM, Michael S wrote:

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp. brushlees, AC sync, AC async) enjoy similar popularity.

IIRC, reluctance motors are also popular here. They are sorta like BLDC,
but cheaper due to not needing big magnets (though, BLDC motors can give
more power in a physically smaller package if compared with reluctance
motors; but reluctance motors are still more compact if compared with AC induction motors).

Like BLDC, it is possible to run reluctance motors at an exact speed.

This is unlike AC induction motors where, although speed can be adjusted
with a VFD, it isn't particularly exact as it depends on the load on the
motor and similar. Accurate speed control on an induction motor will
still require using an encoder, but they are still not good for
positional control (and the effective "holding torque" of an AC
induction motor is very low).

Where more accuracy is needed, something like a big BLDC or reluctance
motor with a servo-drive might be used (typically with hall-effect
sensors in the stator).

Generally, these motors can't be driven open-loop, as they are prone to
"drop out" at relatively little load in these cases.

Technically, the stator construction for a reluctance motor can be
nearly identical to an induction motor, the main differences are in the
design of the rotor.

Where, say, an induction motor typically has a hollow rotor consisting
of layered steel plates with an embedded copper or aluminum "squirrel
cage" (a ring of bars around the perimeter, all shorted together at the
top and bottom).

The reluctance motor can use a solid steel rotor, with gaps machined in
to control where magnetic flux will go.

A typical BLDC motor either has a ring of permanent magnets, or
alternating poles (from the top/bottom) with a central ring magnet.

I had before imagined it should be possible to make a hybrid of a
reluctance and induction rotor for intermediate effects; partly by
filling the gaps in the reluctance rotor with aluminum in place of air.
This could still operate synchronously, but could have better torque
under load and less issue with drop out. If it drops below synchronous
speed, it would instead induce eddy currents in the aluminum parts of
the rotor; rather than the air being "basically useless". However,
aluminum would still behave more like air as far as the magnetic flux
lines are concerned.

Though, some commercial designs had instead gone the other way,
hybridizing the reluctance rotor with a BLDC rotor, and using (cheaper) ceramic magnets in place of rare-earth magnets (as typical in a BLDC).

One variant here resembling a reluctance motor with a split rotor, with
the top/bottom rotated relative to each other, and a central ceramic
ring magnet. Though, I think this pushes it more into the BLDC category.

Also common, on the AC side, are 440 and 208 3-phase.
Many traditional AC induction motors operate on 440VAC 3-phase.
A lot of traditional industrial machines were also 440VAC.

There is some stuff I saw about electrostatic motors gaining popularity
in some areas, but these tend to operate at high voltages but very
little amperage. They are comparably weak compared with magnetic motors,
but can be more energy efficient.

What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.

I never was in big datacenter, but heard that they prefer DC.

DC -> DC allows higher conversion efficiency compared to AC.
Higher voltage distribution also allows more efficiency.

Higher voltage would be needed with DC vs AC, as DC is more subject to resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.

A higher line frequency would increase the relative efficiency of
electrical transformers. Higher voltage AC also has a higher conversion efficiency than lower voltage.

In theory, assuming the AC comes in at 60Hz, could have a sort of rotary converter to boost the line frequency (could have a vaguely similar construction to an AC motor, but where input power uses 6 coils, and the output side has 12 or 24 coils; likely also operating like a boost transformer).

Not sure if anyone already builds this, or the conversion efficiency of
such a device. Would need to hopefully have a high conversion efficiency (otherwise it would not offset the losses in all of any smaller
transformers).

Though, wouldn't really gain anything if just going directly to DC via
bridge rectifiers (with no intermediate transformers), and then using
DC-DC conversion.

So, say 1320VAC 3-phase could likely be rectified into 960VDC, where,
assuming the presence of big capacitors, the voltage would drop slightly
in conversion due to phase ripple (the "peaks" getting flattened out).

Or, in theory, I have little idea where people would get diodes and
capacitors big enough for this. Presumably giant industrial-sized diodes
and capacitors could exist though (well, and/or PCBs with craptons of
smaller components).

Then again, in a relative sense, boards with 1000s of diodes and
capacitors wouldn't cost much relative to the cost of the building and servers.

...

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.

Yes, pretty much.

MOSFET, diode (from ground), inductor, and a capacitor;
Then you need a controller circuit to keep track of the voltage and
adjust the duty cycle as needed to maintain the target voltage.

MOSFET lets power in, which goes through the coil, and charges the
capacitor (in parallel with the load). When the MOSFET turns off, there
is a voltage kick from the inductor (it goes negative), pulling power
from the ground plane.

It is possible to use an opamp for this (rather than a microcontoller),
but an opamp would generate very crude PWM, thus, noisier.

Possible noise reduction approaches:
Big capacitor;
Secondary inductor, diode, and capacitor.
Assuming a constant load, a second inductor could smooth the PWM noise
by maintaining closer to a constant current; but is more likely to see
voltage ripples if there are sudden changes in the load (if compared
with using a bigger capacitor).

Comparably a microcontroller can generate an higher-frequency PWM
signal, and keep the initial noise lower.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 26 23:35:52 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted: --------------------snip----------------------------------

Higher voltage would be needed with DC vs AC, as DC is more subject to resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.

The military routinely uses 400 Hz to reduce the weight of transformers.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Sep 26 19:37:34 2025

From Newsgroup: comp.arch

On 9/26/2025 6:35 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted: --------------------snip----------------------------------

Higher voltage would be needed with DC vs AC, as DC is more subject to
resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.

The military routinely uses 400 Hz to reduce the weight of transformers.

OK, so it makes sense then...

I guessed 240Hz as it could likely be enough to usefully boost
efficiency, but not so high as to cause significant leakage from the building's electrical system.

Something like 400 or 480Hz should also work.

Moving too far into kHz territory is likely to result in significant
leakage.

Though, looking into it, would likely have to get pretty high into the
kHz range before a buildings' power distribution system starts radiating
most of the power into the environment (with most of the sub-kHz
territory likely being pretty safe here).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 27 08:14:11 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp. brushlees, AC sync, AC async) enjoy similar popularity.

I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.

If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.

What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.

I never was in big datacenter, but heard that they prefer DC.

Eventually, electronics requires DC. Of course, you can make
an economic calculation of where you put your transformers and
rectifiers, and where you want which voltage.

An option which makes little sense is to have a rectifier which
creates high-voltage DC, then distributes that, and to have
an alternator at the other end to create AC which you can then
transform down. It would be better to distribute AC and transform
it down, saving two parts.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.

I'm more used to thyristors in that role.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sat Sep 27 13:27:02 2025

From Newsgroup: comp.arch

On 26/09/2025 14:10, Thomas Koenig wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

True DC motors - with brushes - are rarely found outside of very small
motors (where they are cheap and simple). But there are a wide variety
of AC motors controlled in many different ways. Asynchronous AC motors
are only one type. There are lots of other topologies for motors and
their controllers, with different pros and cons and suitable applications.

What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.

There are lots of advantages of distributing power as DC. Transformers
are only a good choice at higher voltages - once you get to the levels
that can be handled well by semiconductor switches, they are smaller and
more efficient, and work best for DC-to-DC. 1200V switches are cheap
and common now, though there are devices that handle a few thousand
volts. Electric car charger standards are 400V and 800V, with some new
ones at 1000V or up to 1500V.

It makes to distribute locally at something like 48V or 60V DC.
Connections are simpler, you can take it directly from an UPS, and the
local conversions to low voltage power lines is simpler than with 120V
or 240V AC.

So for a data centre, using perhaps 800V DC (taking advantage of the
electric car industry standards) to the rack, then 48V DC to the devices
on the rack would seem a good setup to me. DC also makes life much
easier and more efficient when you have UPSs and battery backup -
locally in a rack, or wider in the higher level supply.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sat Sep 27 13:52:23 2025

From Newsgroup: comp.arch

On 27/09/2025 10:14, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my
imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.

I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.

If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.

These are not "DC motors" in the traditional sense, like brushed DC
motors. The motors you use in a car have (roughly) sine wave drive
signals, generally 3 phases (but sometimes more). Even motors referred
to as "Brushless DC motors" - "BLDC" - use AC inputs, though the
waveforms are more trapezoidal than sinusoidal.

And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.

Really, the distinction between "DC motor" and "AC motor" is mostly meaningless, other than for the smallest and cheapest (or oldest)
brushed DC motors.

Bigger brushed DC motors, as you say, used to be used in situations
where you needed speed control and the alternative was AC motors driven
at fixed or geared speeds directly from the 50 Hz or 60 Hz supplies.
And as you say, these were replaced by AC motors driven from frequency inverters. Asynchronous motors (or "induction motors") were popular at
first, but are not common choices now for most use-cases because
synchronous AC motors give better control and efficiencies. (There are,
of course, many factors to consider - and sometimes asynchronous motors
are still the best choice.)

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.

I'm more used to thyristors in that role.

It's better, perhaps, to refer to "semiconductor switches" as a more
general term.

Thyristors are mostly outdated, and are only used now in very high power situations. Even then, they are not your granddad's thyristors, but
have more control for switching off as well as switching on - perhaps
even using light for the switching rather than electrical signals.
(Those are particularly nice for megavolt DC lines.)

You can happily switch multiple MW of power with a single IGBT module
for a could of thousand dollars. Or you can use SiC FETs for up to a
few hundred kW but with much faster PWM frequencies and thus better control.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 27 12:38:14 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> schrieb:

And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.

If you have three phases (required for high-power industrial motors)
I believe people use the three phases directly to convert from three
phases to three phases.

The resulting waveforms are not pretty, and contribute to the
difficulty of measuing power input.

[...]
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sat Sep 27 15:15:41 2025

From Newsgroup: comp.arch

On 27/09/2025 14:38, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM
controlled semiconductor switches.

If you have three phases (required for high-power industrial motors)
I believe people use the three phases directly to convert from three
phases to three phases.

The resulting waveforms are not pretty, and contribute to the
difficulty of measuing power input.

That used to be how it was done - using thyristors, and powering
induction motors. But it is not how it has been done in new motors for
a long time. (In industrial use, some motors can be very big, very
expensive, and very difficult to replace - thus factories can have the
same motors for decades, even though better and more efficient ones are available.)

Using thyristors to regulate the power out from your three phase input
is relatively simple, but as you say, the waveforms are not pretty.
This leads to significant noise (electrical and audible), vibrations,
torque ripple, and wear and tear on the motor. And it makes a mess of
the input supply, giving harmonics and phase differences between the
current and voltage input - which leads to significant loses in the
power delivery. These loses are between the generation and the
customer, meaning the electricity supplier sees it but the customer does
not see it on their bill - thus electricity suppliers greatly dislike
it. The effect is less with thyristors on three phase power than
thyristors on two phase power, but it is still very bad.

So these days, the AC power - two phase or three phase - is invariably converted to DC first, using power factor correction rectification (so
that the instantaneous current draw is proportional to the voltage at
the time, keeping current and voltage in phase and nicely sinusoidal).
After that, the AC drive to the motor is generated using PWM signals -
from perhaps 2 or 4 kHz for old IGBT systems to at least 20 kHz for
newer systems (avoiding audible noise) or up to maybe 160 kHz using GaN
or SiC FETs - higher frequencies mean smaller and lighter inductors and capacitors.

These kinds of motor control are smaller, more efficient, and much more controllable than old thyristor-based drives.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Sep 27 14:56:06 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 9/26/2025 9:28 AM, Scott Lurndal wrote:

In those datacenters, the UPS distributes 48VDC to the rack components
(computers, network switches, storage devices, etc).

48VDC also makes sense, as it is common in other contexts. I sorta
figured a higher voltage would have been used to reduce the wire
thickness needed.

This is within a 19" rack.

I did realize after posting that, if the main power rails were organized
as a grid, the whole building could be done probably with 1.25" aluminum >bars.

The Burroughs V5x0 series ECL machines had Aluminum bus-bars.

Spectacular failure mode when/if something conductive (screwdriver,
wrench) was dropped across the hot and ground bars.

Could power the grid of bars at each of the 4 corners, with maybe some >central diagonal bars (which cross and intersect with the central part
of the grid, and an additional square around the perimeter). Each corner >supply could drive 512A, and with this layout, no bar or segment should >exceed 128A.

In the old mainframe days, there would be large bus-bars (in an enclosure) across the ceiling and plug-in tap boxes would drop power to the
various mainframe units.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Sep 27 14:56:47 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

BGB <cr88192@gmail.com> posted: >--------------------snip----------------------------------

Higher voltage would be needed with DC vs AC, as DC is more subject to
resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.

The military routinely uses 400 Hz to reduce the weight of transformers.

IBM mainframes used 400hz (via a motor-generator set).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Al Kossow@aek@bitsavers.org to comp.arch on Sat Sep 27 08:57:44 2025

From Newsgroup: comp.arch

On 9/26/25 5:37 PM, BGB wrote:

Something like 400 or 480Hz should also work.

Would y'all please change the subject line.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Sep 27 14:23:22 2025

From Newsgroup: comp.arch

On 9/27/2025 6:52 AM, David Brown wrote:

On 27/09/2025 10:14, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my >>> imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.

I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.

If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.

These are not "DC motors" in the traditional sense, like brushed DC motors. The motors you use in a car have (roughly) sine wave drive signals, generally 3 phases (but sometimes more). Even motors referred
to as "Brushless DC motors" - "BLDC" - use AC inputs, though the
waveforms are more trapezoidal than sinusoidal.

Yes.

Typically one needs to generate a 3-phase waveform at the speed they
want to spin the motor at.

I had noted in some experience when writing some code to spin motors (typically on an MSP430, mostly experimentally) or similar:
Sine waves give low noise, but less power;
Square waves are noisier and only work well at low RPM,
but have higher torque.
Sawtooth waves seem to work well at higher RPMs.
Well, sorta, more like sawtooth with alternating sign.
Square-Root Sine: Intermediate between sign and square.
Gives torque more like a square wave, but quieter.
Trapezoid waves are similar to this, but more noise.

Seemingly, one "better" option might be to mutate the wave-shape between Square-Root-Sine and sawtooth depending on the target RPM. Also dropping
the wave amplitude at lower RPMs (at low RPMs motors pull more amperage
and thus generate a lot of heat otherwise).

In this case, the sawtooth wave helps because the coils don't like
changing quickly, so in this case one hits them full power at the start
(to get them going) and then rapidly drop back down to zero, then hit
them the same way on the opposite sign for the next part of the wave.

When I was messing around with it at the time, input control signals
were typically one of:
ADC input connected to a POT (for direct control);
1-2ms RC style PWM.

Step/Dir signaling (typical for stepper drivers and servomotor
controllers) could also make sense.

One other option is dual-phase motors, which have the partial advantage
that one can use a repurposed stepper driver. In this case typically set
to micro-stepping. A lot of the dual-phase motors in this case though
were built from repurposed capacitor-run split-phase motors.

Say, for example, one can be like, "Yeah, this AC split phase motor is
close enough to being a NEMA34 stepper...".

Typically need to partly rewire it as typically the split phase motors
have 3 wires, but need 4 wire in this case. Some other motors are easily modified into 3-phase though (with the same coils as a 3-phase motor internally, just wired into a split-phase configuration with a 60-degree
phase offset; vs 90 degrees in some other motors).

One can get different properties if they machine a new rotor, as these
motors invariably come with squirrel-cage rotors. Easier/cheaper to
machine here being a reluctance rotor.

Main annoyance mostly being that this can be a pretty big chunk of steel
for any non-trivial motor (also heavy). Could likely reduce weight by
making the base rotor by layering multiple sizes of steel tubing, then
brazing or welding it all together with some steel end-caps (drilled out
for the motor shaft, probably also brazed in place). Then turn it to the target diameter, and mill the side grooves.

Well, or find something with sufficiently thick walls (say, a chunk of
5" OD, schedule 120 or 180 steel pipe). This would simplify the process,
and be cheaper (and lighter) than, say, a chunk of 5" bar stock.

Haven't done much in this area for a while, was mostly messing around a
lot more with this when I was a little younger.

And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.

Yes:
Dual-phase: may use a "Dual H-Bridge" configuration
Where, the H-bridge is built using power transistors;
Three-phase: "Triple Half-Bridge"
Needs fewer transistors than dual phase.

It is slightly easier to build these drivers with BJTs or Darlington transistors, but these tend to handle less power and generate more heat,
but are more fault tolerant.

MOSFETs can handle more power, but one needs to be very careful not to
exceed the Gain-Source voltage limit, otherwise they are insta-dead (and
will behave as if they are shorted).

So, one needs a more complex circuit, say:
MOSFET power transistor (typically NMOS);
NPN or PNP control transistor (such as a 2N3904 or similar);
Pull up/down resistors;
Zener diode.
In which case the control transistors can be driven as in a typical
H-Bridge.

Say, for example:
Pull down resistor pulls Gate to Source, keeping it off by default;
Zener diode in parallel with resistor, to impose VGS limit;
Pull-up transistor connects to Drain via a resistor
(via emitter or collector, depending on PNP or NPN).
Base on control transistor used for control.

Then can control the MOSFETs as-if they were BJTs. Not sure why they
can't have this stuff built-in (sort of like with a Darlington), but alas.

One typically also needs flyback diodes, and a main DC rail capacitor,
and a DC rail zener diode, ...

Though, at this stage, more preferable to buy these things than build
them, as the hand-built ones tend to have a bad habit of exploding.

Really, the distinction between "DC motor" and "AC motor" is mostly meaningless, other than for the smallest and cheapest (or oldest)
brushed DC motors.

Pretty much.

More the motor technology one finds in toys and a lot of cordless power
tools. Also the "Power Wheels" vehicles, which tended to use the same
kind of 1/4 HP brushed DC motors often found in cordless power tools.

Some adults had ridden around on these things, but sometimes modded them
out to use bigger 1/2 or 3/4 HP motors. Typically also need a bigger
battery, as they were using repurposed UPS batteries (from one I ended
up tearing down some years ago). Otherwise, mostly all plastic apart
from a steel axle and similar.

Bigger brushed DC motors, as you say, used to be used in situations
where you needed speed control and the alternative was AC motors driven
at fixed or geared speeds directly from the 50 Hz or 60 Hz supplies. And
as you say, these were replaced by AC motors driven from frequency inverters. Asynchronous motors (or "induction motors") were popular at first, but are not common choices now for most use-cases because
synchronous AC motors give better control and efficiencies. (There are,
of course, many factors to consider - and sometimes asynchronous motors
are still the best choice.)

Yeah.

Large brushed DC motors are not usually seen much IME.

Have encountered brushed DC motors up to around 1/2 or 3/4 HP, not sure
if they go much larger.

They often go at lower RPMs, say (IIRC):
1/4 HP: 20000 RPM (roughly 1.25" OD x 2" L)
1/2 HP: 10000 RPM (roughly 2.5" OD x 4" L)
3/4 HP: 6000 RPM (roughly 4" OD x 6" L)

As well as typically being physically larger (though, a 3/4 HP brushed
DC motor is merely the size of a 1/4 HP AC induction motor). Like, very
large by DC motor standards, but by AC motor standards, smaller than the motors typically used to spin the fan blades on an air conditioner unit.

Whereas, a 3/4 HP induction motor is a much bigger beast.

BLDC motors are typically also small. But, pure BLDC motors are also
often very expensive much over 1/4 HP (often because they use neodymium magnets).

But, the other option being reluctance motors, but these may or may not
be passed off as BLDC.

Can sort of tell the difference when spinning them with no power
applied: True BLDCs will have high "cogging torque" (almost more like a stepper motor, but not as strong and with much bigger steps);
If there is a very weak coging torque, it is likely one of the
intermediate reluctance/BLDC hybrids (eg, with a ceramic ring magnet);
If it spins freely (no cogging torque) it is likely a reluctance motor.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.

I'm more used to thyristors in that role.

It's better, perhaps, to refer to "semiconductor switches" as a more
general term.

Thyristors are mostly outdated, and are only used now in very high power situations. Even then, they are not your granddad's thyristors, but
have more control for switching off as well as switching on - perhaps
even using light for the switching rather than electrical signals.
(Those are particularly nice for megavolt DC lines.)

You can happily switch multiple MW of power with a single IGBT module
for a could of thousand dollars. Or you can use SiC FETs for up to a
few hundred kW but with much faster PWM frequencies and thus better
control.

Yes.

For medium power, typically MOSFETs were used.

For low power, typically BJTs or Darlingtons.

But, BJTs seem to become impractical much over around 60V 5A or so. Even
this requires a pretty aggressive heat-sink and/or active cooling.

MOSFETs handle more power with less heat, and are often available up to
around 1000V 50A or so (in TO-247 packaging or similar), but can be run
in parallel as needed for more amps.

IGBTs for when one needs something big...

Never really messed with Thyristors.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Sep 28 12:00:56 2025

From Newsgroup: comp.arch

On Fri, 26 Sep 2025 14:28:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

BGB <cr88192@gmail.com> writes:

On 9/25/2025 9:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.

Hmm...

Brings up a thought: 960VDC is a semi-common voltage in industrial >applications IIRC.

What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).

I looked at PSUs offered by Dell for their rack servers. There are four
options for the inputs, although not every server model has all four.
The options are:
- 100-240 VAC.
- 200-240 VAC
- -48 VDC
- 240-400 VDC

I don't know in which countries and in which branch of IT they prefer
the fourth option, but knowing Dell of late (as opposed to Dell of up to ~2008), they would not offer the option unless demand was quite
significant.

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sun Sep 28 16:44:02 2025

From Newsgroup: comp.arch

On 27/09/2025 21:23, BGB wrote:

On 9/27/2025 6:52 AM, David Brown wrote:

On 27/09/2025 10:14, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial >>>>>> applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors >>>> were that most wide-spread motors by far up to 25-30 years ago. But my >>>> imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.

I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.

If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.

These are not "DC motors" in the traditional sense, like brushed DC
motors. The motors you use in a car have (roughly) sine wave drive
signals, generally 3 phases (but sometimes more). Even motors
referred to as "Brushless DC motors" - "BLDC" - use AC inputs, though
the waveforms are more trapezoidal than sinusoidal.

Yes.

Typically one needs to generate a 3-phase waveform at the speed they
want to spin the motor at.

Details of motor drives is perhaps getting a bit OT for this group - but
there are people here interested in all sorts of things. If you want to
have more discussions on motor drives, comp.arch.embedded might be a
nice place for a new thread - the group appears fairly empty, but
experts crawl out of the woodwork whenever an interesting new thread is started!

I had noted in some experience when writing some code to spin motors (typically on an MSP430, mostly experimentally) or similar:

Experiments are always good, but it is also helpful to combine them with
a bit of theory so that you don't generalise too much from a small
number of tests. In particular, the motor windings in a three phase AC
motor can be done in several different ways, optimised for different
kinds of controlling waves. The two main ones for small and medium
permanent magnet motors are for sinusoidal waves (aiming for smoothest
and most controlled driving - often called "PMSM - permanent magnet synchronous motors") and for trapezoidal driving (for simpler driving,
often referred to as "BLDC - Brushless DC").

Then there are different ways to track the position of the motor. You
can have hall effect sensors, which are simple and cheap, giving 6
positions per electrical rotation (motors can have multiple sets of
windings and magnets, giving two or more electrical rotations per
mechanical rotation). These are good for trapezoidal BLDC control. It
is also possible to use sensorless control, where the hall effect
signals are calculated by measuring the back EMF from the motor windings during the off periods of the driver half bridges. This avoids the
sensors and can make cabling easier, but can't be used at low speed - it
is only suitable for continuously running motors rather than positioning motors.

Or you can have encoders, which give the more precise position needed
for sine wave or PMSM waves. These are usually quadrature encoders,
which are accurate and reliable but need to pass through an index
position to get their absolute position. Sometimes absolute encoders
are used - these are either cheaper but less precise using analogue hall effect sensors, or much more expensive using multiple Grey code rings
with optical or inductive sensing.

For trapezoidal drives, you usually have a simple 6-step switching
sequence, with each of the three half-bridges driving high for 2 steps,
off for 1 step, low for 2 steps, off for one step. You can control the
speed of the motor by the speed of the steps, and the power by using PWM modulation when driving high or low (or by using a single PWM control
for the common DC bus voltage).

For sine wave driving, you need fast PWM for each of the three half
bridges to generate three sine waves at 120° phase differences. The PWM frequency has to be high enough so that after the filtering in the motor windings, you have little in the way of harmonics.

Generally, however, instead of actively producing sine waves, you do
what is known as "vector control" - you measure the currents in the
three branches, and use the angle data to convert these to currents perpendicular to and aligned to the motor position. You then regulate
the PWM values to control these two currents - aiming to get the desired current in the active direction, and zero current perpendicular to it
(since that is just wasted effort). The resulting waveforms are
somewhat distorted sine waves.

An msp430 is fine for trapezoidal control and hall effect sensors, but a
bit underpowered for serious sine wave or vector control. You are
better with a Cortex-M4 for motors.

Sine waves give low noise, but less power;

Sine waves are closer to the ideal for many motors, but you'll get even
lower noise with good vector control.

You can also try adding some third harmonic - use sin(x) + 1/9 sin(3x).
The third harmonic disappears in the motor, since it affects all three
phases equally. But it flattens out the peaks of the sine wave and lets
you then increase the amplitude by about 12.5% before hitting 100% of
your DC bus voltage.

Square waves are noisier and only work well at low RPM,
    but have higher torque.

Square waves are a really bad idea - you jump between high torque and
low torque, and will regularly be pulling the motor back a bit rather
than forwards. Prefer trapezoidal control - it is just as easy, and
works vastly better. You of course get more torque ripple than with
sine waves or vector control.

Sawtooth waves seem to work well at higher RPMs.
    Well, sorta, more like sawtooth with alternating sign.

Do you mean trapezoidal control?

Square-Root Sine: Intermediate between sign and square.
    Gives torque more like a square wave, but quieter.

That's just weird. I think what you are seeing is something similar to
the shape you get from vector control.

    Trapezoid waves are similar to this, but more noise.

Seemingly, one "better" option might be to mutate the wave-shape between Square-Root-Sine and sawtooth depending on the target RPM. Also dropping
the wave amplitude at lower RPMs (at low RPMs motors pull more amperage
and thus generate a lot of heat otherwise).

Of course you have lower average voltage at lower speeds and torques -
that's why you use PWM to control your wave amplitudes.

And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using
PWM controlled semiconductor switches.

Yes:
Dual-phase: may use a "Dual H-Bridge" configuration
    Where, the H-bridge is built using power transistors;
Three-phase: "Triple Half-Bridge"
    Needs fewer transistors than dual phase.

It is slightly easier to build these drivers with BJTs or Darlington transistors, but these tend to handle less power and generate more heat,
but are more fault tolerant.

MOSFETs can handle more power, but one needs to be very careful not to exceed the Gain-Source voltage limit, otherwise they are insta-dead (and will behave as if they are shorted).

FETs are always used (until voltage or current requirements force you to
use IGBTs) - no one uses BJTs or Darlingtons in real motor control. You
need an appropriate gate driver for the FETs, but those are common and
cheap, and usually combined with deadtime control to avoid accidental shoot-thrown when you enable the high side and low side together.
Modules that combine all this with three half-bridges are also small and cheap.

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Microbot
  Tue Oct 7 01:30:59 2025
  from Moore, Ok via Telnet
- Microbot
  Mon Oct 6 03:01:21 2025
  from Moore, Ok via Telnet
- Djatropine
  Sun Oct 5 20:05:43 2025
  from Memphis, Tn via SSH
- Microbot
  Sun Oct 5 04:13:15 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,071
Nodes:	10 (0 / 10)
Uptime:	186:19:51
Calls:	13,762
Calls today:	1
Files:	186,985
D/L today:	8,364 files (2,641M bytes)
Messages:	2,427,100

Intel's Software Defined Super Cores

Who's Online

Recent Visitors

System Info