Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
When I saw a post about a new way to do OoO, I had thought it might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it.
Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
When I saw a post about a new way to do OoO, I had thought it might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it.
Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
Sounds like [multiscalar processors](doi:multiscalar processor)^^^^^^^^^^^^^^^^^^^^^
[ I guess it can be useful to actully look at what one pasts before
pressing "send", eh? ]
When I saw a post about a new way to do OoO, I had thought it might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it.
Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.
John Savard--- Synchronet 3.21a-Linux NewsLink 1.2
On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
On further reflection, this may be equivalent to re-inventing out-of-order >execution.
John Savard
John Savard <quadibloc@invalid.invalid> posted:
When I saw a post about a new way to do OoO, I had thought it might be
talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting
programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it. >>
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
Andy Glew was working on stuff like this 10-15 years ago
MitchAlsup wrote:
John Savard <quadibloc@invalid.invalid> posted:
When I saw a post about a new way to do OoO, I had thought it
might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core-
intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by
splitting programs into chunks that can be performed in parallel
on different cores, where the cores are intimately connected in
order to make this work.
This is a sound idea, but one may not find enough opportunities to
use it.
Although it's called "inverse hyperthreading", this technique
could be combined with SMT - put the chunks into different threads
on the same core, rather than on different cores, and then one
wouldn't need to add extra connections between cores to make it
work.
Andy Glew was working on stuff like this 10-15 years ago
That's what immediately fell to my mind as well, it looks a lot like
trying some of his ideas about scouting micro-threads, doing work in
the hope that it will turn out useful.
To me it sounds like it is related to eager execution, except
skipping further forward into upcoming code.
Terje
The question is what is most likely meaning of the fact of patenting?
IMHO, it means that they explored the idea and decided against going in
this particular direction in the near and medium-term future.
I think that when Intel actually plans to use particular idea then they
keep the idea secret for as long as they can and either don't patent at
all or apply for patent after release of the product.
I can be wrong about it.
Some of them 1 year ago gave representations
about advantages of removal of SMT.
Removal of SMT and this super-core
idea can be considered complimentary - both push into direction of
cores with smaller # of EU pipes.
Anyway, couple of months ago Tan himself said that Intel is reversing
the decision to remove SMT.
When I saw a post about a new way to do OoO, I had thought it might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it.
Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
John Savard
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.
On 9/15/2025 6:54 PM, John Savard wrote:
When I saw a post about a new way to do OoO, I had thought it might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it.
Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.
Say, more cores and less power use, at the possible expense of some
amount of performance.
...
John Savard
BGB <cr88192@gmail.com> schrieb:
Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.
For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.
See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.
For a later perspective, see
https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md
On 9/15/2025 6:54 PM, John Savard wrote:
When I saw a post about a new way to do OoO, I had thought it might be
talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting
programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it. >>
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.
BGB <cr88192@gmail.com> schrieb:
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.
See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
According to BGB <cr88192@gmail.com>:
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
That sounds a whole lot like what Transmeta did 25 years ago:
https://en.wikipedia.org/wiki/Transmeta_Crusoe
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
John Levine <johnl@taugh.com> writes:
https://en.wikipedia.org/wiki/Transmeta_Crusoe
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
It definitely was. However, even a modern high-performance OoO cores
like Apple M1-M4's P-cores or on Qualcomm's Oryon, the performance of dynamically-translated AMD64 code is usually slower than on comparable
CPUs from Intel and AMD.
- anton
BGB <cr88192@gmail.com> writes:
Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an x86
chip by running *everything* in a firmware level emulator via
dynamic translation.
Intel has already done so, although AFAIK not at the firmware level:
Every IA-64 CPU starting with the Itanium II did not implement IA-32
in hardware (unlike the Itanium), but instead used dynamic
translation.
There is no reason for Intel to repeat this mistake, or for anyone
else to go there, either.
- anton
BGB <cr88192@gmail.com> schrieb:
Still sometimes it seems like it is only a matter of time until Intel or AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.
For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.
See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.
For a later perspective, see
https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md
Thomas Koenig <tkoenig@netcologne.de> writes:
BGB <cr88192@gmail.com> schrieb:
Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at theFor AMD, that has happend already a few decades ago; they translate
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
x86 code into RISC-like microops.
That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in <2015Dec6.152525@mips.complang.tuwien.ac.at>:
|Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions with |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop" |standing for "RISC").
Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.
As Mitch Alsup has written, AMD has its MacroOps (load-op and RMW) in addition to the Rops. It's not entirely clear which parts of the
engine see MacroOps and ROPs, but my impression was that the MacroOps
are not split into ROPs for the largest part of the OoO engine.
See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.
From 1998. Unfortunately, there are not many more recent books about
the microarchitecture of OoO CPUs. What I have found:
Modern Processor Design: Fundamentals of Superscalar Processors
John Paul Shen, Mikko H. Lipasti
McGraw-Hill
656 pages
published 2004 or so (don't let the 2013 date from the reprint fool you) Discusses CPU design (not just OoO) using various real CPUs from the
1990s as example.
Processor Microarchitecture -- An Implementation Perspective
Antonio Gonzalez , Fernando Latorre , Grigorios Magklis
Springer
published 2010
Relatively short, discusses the various parts of an OoO CPU and how to implement them.
Henry Wong
A Superscalar Out-of-Order x86 Soft Processor for FPGA
Ph.D. thesis, U. Toronto https://www.stuffedcow.net/files/henry-thesis-phd.pdf
Slides: https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf published 2017
A problem is that the older books don't cover recent developments such
as alias prediction and that Wong was limited by what a single person
can do (his work was not part of a larger research project at
U. Toronto), as well as what fits into an FPGA.
BTW, Wong's work can be seen as a refutation of BGB's statement: He
chose to implement IA-32; on slide 14 of <https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf> he
states "It’s easy to implement!".
- anton
Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
BGB <cr88192@gmail.com> schrieb:
Still sometimes it seems like it is only a matter of time untilFor AMD, that has happend already a few decades ago; they translate
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.
x86 code into RISC-like microops.
That's nonsense; regulars of this groups should know better, at
least this nonsense has been corrected often enough. E.g., I wrote
in <2015Dec6.152525@mips.complang.tuwien.ac.at>:
|Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the "R"
of "Rop" |standing for "RISC").
I don't know what you are objecting to - Intel calls its internal instructions micro-operations or uOps, and AMD calls its Rops.
On Thu, 18 Sep 2025 12:33:44 -0400
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:I don't know what you are objecting to - Intel calls its internal
BGB <cr88192@gmail.com> schrieb:That's nonsense; regulars of this groups should know better, at
Still sometimes it seems like it is only a matter of time untilFor AMD, that has happend already a few decades ago; they translate
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.
x86 code into RISC-like microops.
least this nonsense has been corrected often enough. E.g., I wrote
in <2015Dec6.152525@mips.complang.tuwien.ac.at>:
|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the "R"
of "Rop" |standing for "RISC").
instructions micro-operations or uOps, and AMD calls its Rops.
No, they don't. They stopped using term Rops almost 25 years ago.
If they used it in early K7 manuals then it was due to inertia (K6
manuals copy&pasted without much of thought given) and partly because
of marketing, because RISC was considered cool.
Thomas Koenig <tkoenig@netcologne.de> posted:
BGB <cr88192@gmail.com> schrieb:
Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.
With a very loose definition of RISC::
a)Does a RISC ISA contain memory reference address generation from
the pattern [Rbase+Rindex<<scale+Displacement] ??
Some will argue yes, others no.
b) does a RISC ISA contain memory reference instructions that are
combined with arithmetic calculations ??
Some will argue yes, others no.
c) does a RISC ISA contain memory reference instructions that
access memory twice ??? LD-OP-ST :: but the TLB only once ?!?
Most would argue no.
Yet, this is the µISA of K7 and K8. It is only RISC in the very
loosest sense of the word.
And do not get me started on the trap/exception/interrupt model.
See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.
For a later perspective, see
https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md
Michael S wrote:
On Thu, 18 Sep 2025 12:33:44 -0400
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:I don't know what you are objecting to - Intel calls its internal
BGB <cr88192@gmail.com> schrieb:That's nonsense; regulars of this groups should know better, at
Still sometimes it seems like it is only a matter of time untilFor AMD, that has happend already a few decades ago; they
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator
via dynamic translation.
translate x86 code into RISC-like microops.
least this nonsense has been corrected often enough. E.g., I
wrote in <2015Dec6.152525@mips.complang.tuwien.ac.at>:
|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the
"R" of "Rop" |standing for "RISC").
instructions micro-operations or uOps, and AMD calls its Rops.
No, they don't. They stopped using term Rops almost 25 years ago.
If they used it in early K7 manuals then it was due to inertia (K6
manuals copy&pasted without much of thought given) and partly
because of marketing, because RISC was considered cool.
And the fact that all the RISC processors ran rings around the CISC
ones.
So they wanted to promote that "hey, we can go fast too!"
Ok, AMD dropped the "risc" prefix 25 years ago.
That didn't change the way it works internally.
They still use the term "micro op" in the Intel and AMD Optimization
guides. It still means an micro-architecture specific internal
simple, discrete unit of execution, albeit a more complex one as
transistor budgets allow.
On 9/17/2025 4:33 PM, John Levine wrote:
According to BGB <cr88192@gmail.com>:
Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
That sounds a whole lot like what Transmeta did 25 years ago:
https://en.wikipedia.org/wiki/Transmeta_Crusoe
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy >efficiency or core count (and, in those days, processors were generally >single-core).
Now we have a different situation:
Moore's law is dying off;
Scalar CPU performance has hit a plateau;
And, for many uses, performance is "good enough";
A lot more software can make use of multi-threading;
Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance
on a comparably smaller and cheaper core, and with a somewhat better >"performance per watt" metric.
So, one possibility could be, rather than a small number of big/fast
cores (either VLIW or OoO), possibly a larger number of smaller cores.
The cores could maybe be LIW or in-order RISC.
Like, most of the ARM chips don't exactly have like a 150W TDP or similar...
Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >slower, it may still win in Perf/W and similar...
recent proposals for indexed load/store and auto-increment popping up,
I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.
The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).
Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
BGB <cr88192@gmail.com> schrieb:
Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by runningFor AMD, that has happend already a few decades ago; they translate
*everything* in a firmware level emulator via dynamic translation.
x86 code into RISC-like microops.
That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in
<2015Dec6.152525@mips.complang.tuwien.ac.at>:
|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions with
|around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop"
|standing for "RISC").
I don't know what you are objecting to
The number of bits has nothing to do with what it is called.
If this uOp was for a ROB style design where all the knowledge about
each instruction including register ids, immediate data,
scheduling info, result data, status, is stored in a single ROB entry,
then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.
Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch
prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.
And a uOp triggers that action sequence.
I don't see the distinction you are trying to make.
It's not entirely clear which parts of the
engine see MacroOps and ROPs, but my impression was that the MacroOps
are not split into ROPs for the largest part of the OoO engine.
AMD explains there terminology here but note that the relationship
between Macro-Ops and Micro-Ops is micro-architecture specific.
A Seventh-Generation x86 Microprocessor, 1999 >https://www.academia.edu/download/70925991/4.79985120211001-19357-4pufup.pdf
"An [micro-]OP is the minimum executable entity understood by the machine."
A macro-op is a bundle of 1 to 3 micro-ops.
Simple instructions map to 1 macro and 1-3 micro ops
and this mapping is done in the decoder.
Complex instructions map to one or more "micro-lines" each of which
consists of 3 macro-ops (of 1-3 micro-ops each) pulled from micro-code ROM.
This is a bit introductory level:
Book
Computer Organization and Design
The Hardware/Software Interface: RISC-V Edition, 2018
Patterson, Hennessy
EricP <ThatWouldBeTelling@thevillage.com> writes:-------------------------------
Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
BGB <cr88192@gmail.com> schrieb:
Yes, so much is clear. It's not clear where Macro-Ops are in play and
where Micro-Ops are in play. Over time I get the impression that the macro-ops are the main thing running through the OoO engine, and
Micro-Ops are only used in specific places, but it's completely
unclear to me where. E.g., if they let an RMW Macro-Op run through
the OoO engine, it would first go to the LSU for the address
generation, translation and load, then to the ALU for the
modification, then to the LSU for the store, and then to the ROB.
Where in this whole process is a Micro-Op actually stored?
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
BGB <cr88192@gmail.com> writes:--------------------------------------
On 9/17/2025 4:33 PM, John Levine wrote:
According to BGB <cr88192@gmail.com>:
I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.
The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).
The bandwidth requirements to main memory for given cache sizes per
core reduce linearly with the performance of the cores; if the larger
number of smaller cores really leads to increased aggregate
performance, additional main memory bandwidth is needed, or you can compensate for that with larger caches.
But to eliminate some variables, let's just consider the case where we
want to get the same performance with the same main memory bandwidth
from using more smaller cores than we use now. Will the resulting CPU require less area? The cache sizes per core are not reduced, and
their area is not reduced much.
The core itself will get smaller, and
its performance will also get smaller (although by less than the
core).
But if you sum up the total per-core area (core, caches, and interconnect), at some point the per-core area reduces by less than
the per-core performance, so for a given amount of total performance,
the area goes up.
There is one counterargument to these considerations: The largest configuration of Turin dense has less cache for more cores than the--- Synchronet 3.21a-Linux NewsLink 1.2
largest configuration of Turin. I expect that's the reason why they
offer both; if you have less memory-intensive loads, Turin dense with
the additional cores will give you more performance, otherwise you
better buy Turin.
Also, Intel has added 16 E-Cores to their desktop chips without giving
them the same amount of caches as the P-Cores; e.g., in Arrow lake we
have
P-core 48KB D-L0 64KB I-L1 192KB D-L1 3MB L2 3MB L3/core
E-Core 32KB D-L1 64KB I-L1 4MB L2/4 cores 3MB L3/4cores
Here we don't have an alternative with more P-Cores and the same
bandwidth, so we cannot contrast the approaches. But it's certainly
the case that if you have a bandwidth-hungry load, you don't need to
buy the Arrow Lake with the largest number of E-Cores.
- anton
BGB <cr88192@gmail.com> writes:
Like, most of the ARM chips don't exactly have like a 150W TDP or similar...
And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.
Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...
No TDP numbers are given for Oryon. For Apple's M4, the numbers are
M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W
Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.
[RISC-V]
recent proposals for indexed load/store and auto-increment popping up,
Where can I read about that.
- anton
BGB <cr88192@gmail.com> writes:
On 9/17/2025 4:33 PM, John Levine wrote:
According to BGB <cr88192@gmail.com>:
Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
That sounds a whole lot like what Transmeta did 25 years ago:
https://en.wikipedia.org/wiki/Transmeta_Crusoe
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy
efficiency or core count (and, in those days, processors were generally
single-core).
IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
would be a good idea nowadays, it would not have been canceled.
How should the number of cores change anything? If you cannot make single-threaded IA-32 or AMD64 programs run at competetive speeds on
IA-64 hardware, how would that inefficiency be eliminated in
multi-threaded programs?
Now we have a different situation:
Moore's law is dying off;
Even if that is the case, how should that change anything about the
relative merits of the two approaches?
Scalar CPU performance has hit a plateau;
True, but again, what's the relevance for the discussion at hand?
And, for many uses, performance is "good enough";
In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.
A lot more software can make use of multi-threading;
Possible, but how would it change things?
Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance
on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.
Evidence?
On 9/19/2025 4:50 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 9/17/2025 4:33 PM, John Levine wrote:
According to BGBÂ <cr88192@gmail.com>:
Still sometimes it seems like it is only a matter of time until
Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>>> hardware level, but then pretends to still be an x86 chip by running >>>>> *everything* in a firmware level emulator via dynamic translation.
That sounds a whole lot like what Transmeta did 25 years ago:
https://en.wikipedia.org/wiki/Transmeta_Crusoe
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy
efficiency or core count (and, in those days, processors were generally
single-core).
IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
would be a good idea nowadays, it would not have been canceled.
How should the number of cores change anything? If you cannot make
single-threaded IA-32 or AMD64 programs run at competetive speeds on
IA-64 hardware, how would that inefficiency be eliminated in
multi-threaded programs?
Now we have a different situation:
  Moore's law is dying off;
Even if that is the case, how should that change anything about the
relative merits of the two approaches?
  Scalar CPU performance has hit a plateau;
True, but again, what's the relevance for the discussion at hand?
  And, for many uses, performance is "good enough";
In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.
Possibly, it depends.
The question is what could Intel or AMD do if the wind blew in that direction.
For the end-user, the experience is likely to look similar, so they
might not need to know/care if they are using some lower-power native
chip, or something that is internally running on a dynamic translator to some likely highly specialized ISA.
  A lot more software can make use of multi-threading;
Possible, but how would it change things?
Multi-threaded software does not tend to depend as much on single-thread performance as single threaded software...
Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance >>> on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.
Evidence?
No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).
The RasPi basically runs circles around the Eee...
Though, no good datapoints for fast x86 emulators here.
 At least DOSBox and QEMU running x86 on RasPi tend to be dead slow.
( no time right now, so skipping rest )
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:I don't know what you are objecting to
BGB <cr88192@gmail.com> schrieb:That's nonsense; regulars of this groups should know better, at least
Still sometimes it seems like it is only a matter of time until Intel or >>>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>>> hardware level, but then pretends to still be an x86 chip by running >>>>> *everything* in a firmware level emulator via dynamic translation.For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.
this nonsense has been corrected often enough. E.g., I wrote in
<2015Dec6.152525@mips.complang.tuwien.ac.at>:
|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions with
|around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop"
|standing for "RISC").
I am objecting to the claim that uops are RISC-like, and that there is
a translation to RISC occuring inside the CPU, and (not present here,
but often also claimed) that therefore there is no longer a difference between RISC and non-RISC.
One can discuss the details, but at the end of the day, uops are some implementation-specific internals of the microarchitecture, whereas a
RISC architecture is an architecture.
The number of bits has nothing to do with what it is called.
If this uOp was for a ROB style design where all the knowledge about
each instruction including register ids, immediate data,
scheduling info, result data, status, is stored in a single ROB entry,
then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.
Yes, P6 is the code name for the Pentium Pro, which has a ROB, and,
more importantly valued reservation stations, and yes, the 118 or
whatever bits include the operands. I have no idea how the P6 handles
its 80-bit FP with valued RSs; maybe it has bigger uops in its FP part
(but I think it has a unified scheduler, so that would not work out,
or maybe I miss something).
But concerning the discussion at hand: Containing the data is a
significant deviation from RISC instruction sets, and RISC
instructions are typically only 32 bits or 16 bits wide.
Another difference is that the OoO engine that sees the uOps performsAnd a uOp triggers that action sequence.
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch
prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.
I don't see the distinction you are trying to make.
The major point is that the OoO engine (the part that deals with uops)
sees a linear sequence of uops it has to process, with nearly all
actual branch processing (which an architecture has to do) done in a
part that does not deal with uops. With the advent of uop caches that
has changed a bit, but many of the CPUs for which the uop=RISC claim
has been made do not have an uop cache.
On Fri, 19 Sep 2025 09:50:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more
expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).
That particualr problem is addressed by grouping smaller cores into
clusters with shared L2 cache. It's especially effective for scaling
when L2 cache is true inclusive relatively to underlying L1 caches.
The price is limited L2 bandwidth as seen by the cores.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
EricP <ThatWouldBeTelling@thevillage.com> writes:-------------------------------
Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
BGB <cr88192@gmail.com> schrieb:
Yes, so much is clear. It's not clear where Macro-Ops are in play and
where Micro-Ops are in play. Over time I get the impression that the
macro-ops are the main thing running through the OoO engine, and
Micro-Ops are only used in specific places, but it's completely
unclear to me where. E.g., if they let an RMW Macro-Op run through
the OoO engine, it would first go to the LSU for the address
generation, translation and load, then to the ALU for the
modification, then to the LSU for the store, and then to the ROB.
Where in this whole process is a Micro-Op actually stored?
In the reservation station.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
On 9/17/2025 4:33 PM, John Levine wrote:
According to BGB <cr88192@gmail.com>: >--------------------------------------
I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.
Yes, exactly:: if you have a large number of cores doing a performance of
X, they will need exactly the same memory BW as a smaller number of cores >also performing at X.
Sooner or later, you actually have to read/write main memory.
But to eliminate some variables, let's just consider the case where we
want to get the same performance with the same main memory bandwidth
from using more smaller cores than we use now. Will the resulting CPU
require less area? The cache sizes per core are not reduced, and
their area is not reduced much.
A core running at ½ the performance can use a cache that is ¼ the size
and see the same percentage degradation WRT cache misses (as long as
main memory is equally latent). TLBs too.
12× smaller and 12× lower power
for 1/2 the performance
On 9/19/2025 9:33 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
Like, most of the ARM chips don't exactly have like a 150W TDP or similar...
And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently
higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.
Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...
Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.
Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...
No TDP numbers are given for Oryon. For Apple's M4, the numbers are
M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W
Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.
A lot of the ARM SoC's I had seen had lower TDPs, though more often with >Cortex A53 or A55/A78 cores or similar:
Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.
[RISC-V]
recent proposals for indexed load/store and auto-increment popping up,
Where can I read about that.
For now, just on the mailing lists, eg: >https://lists.riscv.org/g/tech-arch-review/message/368
On 9/19/2025 4:50 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
And, for many uses, performance is "good enough";
In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.
Possibly, it depends.
The question is what could Intel or AMD do if the wind blew in that >direction.
Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance >>> on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.
Evidence?
No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).
The RasPi basically runs circles around the Eee...
I see the difference between CISC and RISC as in the micro-architecture,
changing from a single sequential state machine view to multiple concurrent >machines view, and from Clocks Per Instruction to Instructions Per Clock.
The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.
One can take an Alpha ISA and implement it with a microcoded sequencer
but that should not be called RISC
RISC changes that design to one like a multi-threaded program with
messages passing between them called uOps, where the dynamic state
of each instruction is mostly carried with the uOp message,
and each thread does something very simple and passes the uOp on.
Where global resources are required, they are temporarily dynamically >allocated to the uOp by the various threads, carried with the uOp,
and returned later when the uOp message is passed to the Retire thread.
The Retire thread is the only one which updates the visible global state.
The RISC design guidelines described by various papers, rather than
go/no-go decisions, are mostly engineering compromises for consideration
of things which would make an MST-MPA more expensive to implement or >otherwise interfere with maximizing the active concurrency of all threads.
This is why I think it would have been possible to build a risc-style
PDP-11 in 1975 TTL, or a VAX if they had just left the instructions of
the same complexity as PDP-11 ISA (53 opcodes, max one immediate,
max one mem op per instruction).
On 9/19/2025 9:33 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
Like, most of the ARM chips don't exactly have like a 150W TDP or similar...
And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently
higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.
Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...
Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.
Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...
No TDP numbers are given for Oryon. For Apple's M4, the numbers are
M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W
Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.
A lot of the ARM SoC's I had seen had lower TDPs, though more often with Cortex A53 or A55/A78 cores or similar:
Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.
Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j
TDP 5W, has A55 and A78 cores.
Some amount of the HiSilicon numbers look similar...
But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W
So, more like 10x here, but ...
Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...
Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.
EricP <ThatWouldBeTelling@thevillage.com> writes:
I see the difference between CISC and RISC as in the micro-architecture,
But the microarchitecture is not an architectural criterion.
changing from a single sequential state machine view to multiple concurrent >> machines view, and from Clocks Per Instruction to Instructions Per Clock.
People changed from talking CPI to IPC when CPI started to go below 1.
That's mainly a distinction between single-issue and superscalar CPUs.
The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.
The same holds true for the MIPS R2000, the ARM1/2 (and probably many successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.
And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
I see the difference between CISC and RISC as in the micro-architecture,
But the microarchitecture is not an architectural criterion.
changing from a single sequential state machine view to multiple concurrent >>> machines view, and from Clocks Per Instruction to Instructions Per Clock. >>People changed from talking CPI to IPC when CPI started to go below 1.
That's mainly a distinction between single-issue and superscalar CPUs.
The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.
The same holds true for the MIPS R2000, the ARM1/2 (and probably many
successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.
And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).
Maybe relevant:
Performance optimizers writing asm regularly hit that 1 IPC on the 486
and (with more difficulty) 2 IPC on the Pentium.
When we did get there, the final performance was typically 3X compiled C code.
That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
the PPro and later OoO CPUs.
Yes, organizing the interconnect in a hierarchical way can help reduce
the increase in interconnect cost, but I expect that there is a reason
why Intel did not do that for its server CPUs with P-Cores, by e.g.,
forming clusters of 4, and then continuing with the ring; instead,
they opted for a grid interconnect.
- anton
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
I see the difference between CISC and RISC as in the micro-architecture, >>>But the microarchitecture is not an architectural criterion.
changing from a single sequential state machine view to multiple concurrentPeople changed from talking CPI to IPC when CPI started to go below 1.
machines view, and from Clocks Per Instruction to Instructions Per Clock. >>>
That's mainly a distinction between single-issue and superscalar CPUs.
The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX, >>>> 386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.
The same holds true for the MIPS R2000, the ARM1/2 (and probably many
successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.
And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).
Maybe relevant:
Performance optimizers writing asm regularly hit that 1 IPC on the 486
and (with more difficulty) 2 IPC on the Pentium.
When we did get there, the final performance was typically 3X compiled C
code.
That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
the PPro and later OoO CPUs.
And then came back with SIMD, I presume? :-)
BGB <cr88192@gmail.com> wrote:
On 9/19/2025 9:33 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
Like, most of the ARM chips don't exactly have like a 150W TDP or similar...
And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently
higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.
Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...
Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.
Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >>>> slower, it may still win in Perf/W and similar...
No TDP numbers are given for Oryon. For Apple's M4, the numbers are
M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W
Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.
A lot of the ARM SoC's I had seen had lower TDPs, though more often with
Cortex A53 or A55/A78 cores or similar:
Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.
Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j
TDP 5W, has A55 and A78 cores.
Some amount of the HiSilicon numbers look similar...
But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W
So, more like 10x here, but ...
Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...
Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.
Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to about 50000 DMIPS. Dhrystone contain string operations which benefit
from SSE/AVX, but I would expect that on media load speed ratio would
be even more favourable to desktop core. On jumpy code ratio is probably lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.
It is hard to compare performance per watt: Orange Pi Zero 3 has low
power draw (of order 100 mA from 5V USB charger with one core active) and
it is not clear how it is distributed between CPU-s and Etherent interface. RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
normally seem to run at at fraction of rated power too (but I have
no way to directly measure CPU power draw).
Of course, there is a catch: desktop CPU is made on more advanced
process than small processors. So it is hard to separate effects
from architecture and from the process.
BGB <cr88192@gmail.com> writes:
On 9/19/2025 4:50 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
And, for many uses, performance is "good enough";
In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.
Possibly, it depends.
The question is what could Intel or AMD do if the wind blew in that
direction.
What direction?
Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance >>>> on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.
Evidence?
No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).
The RasPi basically runs circles around the Eee...
That's probably a software problem. Different Eee PC models have
different CPUs, Celeron M @571Mhz, 900MHz, or 630MHz, Atoms with 1330-1860Mhz, or AMD C-50 or E350. All of them are quite a bit faster
than the 700Mhz ARM11. While I don't have a Raspi1 result on https://www.complang.tuwien.ac.at/franz/latex-bench, I have a Raspi 3
result (and the Raspi 3 with its 1200MHz 2-wide core is quite a bit
faster than the 700Mhz ARM11), and also some CPUs similar to those
used in the Eee PC; numbers are times in seconds:
- Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
- Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89
- Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
- AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
So all of these CPUs clearly beat the one in the Raspi3, which I
expect to be clearly faster than the ARM11.
Now imagine running the software that made the Eee PC so slow with
dynamic translation on a Raspi1. How slow would that be?
- anton
On 9/20/2025 8:10 AM, Waldek Hebisch wrote:
BGB <cr88192@gmail.com> wrote:
On 9/19/2025 9:33 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
Like, most of the ARM chips don't exactly have like a 150W TDP or similar...
And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently >>>> higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.
Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...
Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.
Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >>>>> slower, it may still win in Perf/W and similar...
No TDP numbers are given for Oryon. For Apple's M4, the numbers are
M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W
Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.
A lot of the ARM SoC's I had seen had lower TDPs, though more often with >>> Cortex A53 or A55/A78 cores or similar:
Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.
Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j
TDP 5W, has A55 and A78 cores.
Some amount of the HiSilicon numbers look similar...
But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W
So, more like 10x here, but ...
Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...
Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.
Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks >> to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to >> about 50000 DMIPS. Dhrystone contain string operations which benefit
from SSE/AVX, but I would expect that on media load speed ratio would
be even more favourable to desktop core. On jumpy code ratio is probably
lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.
It is hard to compare performance per watt: Orange Pi Zero 3 has low
power draw (of order 100 mA from 5V USB charger with one core active) and
it is not clear how it is distributed between CPU-s and Etherent interface. >> RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
normally seem to run at at fraction of rated power too (but I have
no way to directly measure CPU power draw).
Of course, there is a catch: desktop CPU is made on more advanced
process than small processors. So it is hard to separate effects
from architecture and from the process.
I had noted before that when I compiled Dhrystone on my Ryzen using
MSVC, it is around 10M, or 5691 DMIPs, or around 1.53 DMIPs/MHz.
Curiously, the score is around 4x higher (around 40M) if Dhrystone is compiled with GCC (and around 2.5x with Clang).
For most other things, the performance scores seem closer.
I don't really trust GCC's and Clang's Dhrystone scores as they seem basically out-of-line with most other things I can measure.
Noting my BJX2 core seems to perform at 90K at 50MHz, or 1.02 DMIPS/MHz.
If assuming MSVC as the reference, this would imply (after normalizing
for clock-speeds) that the Ryzen only gets around 50% more IPC.
I noted when compiling my BJX2 emulator:
My Ryzen can emulate it at roughly 70MHz;
My cell-phone can manage it at roughly 30MHz.
This isn't *that* much larger than the difference in CPU clock speeds.
It is like, I seemingly live in a world where a lot of my own benchmark attempts tend to be largely correlated with the relative different in
clock speeds and similar.
Well, except for my old laptop (from 2003), and an ASUS Eee, which seem
to perform somewhat below that curve.
Though, in the case of the laptop, it may be a case of not getting all
that much memory bandwidth from a 100MHz DDR1 SO-DIMM (a lot of the performance on some tests seems highly correlated with "memcpy()"
speeds, and on that laptop, its memcpy speeds are kinda crap if compared with CPU clock-speed).
Well, and the Eee has, IIRC, an Intel Atom N270 down-clocked to 630 MHz.
Thing ran Quake and Quake 2 pretty OK, but not much else.
Though, if running the my emulator on the laptop, it is more back on the curve of relative clock-speed, rather than on the
relative-memory-bandwidth curve.
It seems both my neural-net stuff and most of my data compression stuff, more follow the memory bandwidth curve (though, for the laptop, it seems
NN stuff can get a big boost here by using BFloat16 and getting a little clever with the repacking).
Well, and then my BJX2 core seems to punch slightly outside its weight
class (MHz wise) by having disproportionately high memory bandwidth.
...
On Tue, 16 Sep 2025 00:03:51 -0000 (UTC), John Savard ><quadibloc@invalid.invalid> wrote:
On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
On further reflection, this may be equivalent to re-inventing out-of-order >>execution.
John Savard
Sounds more like dynamic micro-threading.
Over the years I've seen a handful of papers about compile time >micro-threading: that is the compiler itself identifies separable
dependency chains in serial code and rewrites them into deliberate
threaded code to be executed simultaneously.
It is not easy to do under the best of circumstances and I've never
seen anything about doing it dynamically at run time.
To make a thread worth rehosting to another core, it would need to be
(at least) many 10s of instructions in length. To figure this out >dynamically at run time, it seems like you'd need the decode window to
be 1000s of instructions and a LOT of "figure-it-out" circuitry.
MMV, but to me it doesn't seem worth the effort.
But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.
But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.
AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.
So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.
On 22/09/2025 17:28, Stefan Monnier wrote:
But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.
AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.
So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.
Yes, I think that is correct.
A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more efficient
at general integer work and other common actions, as a result of a
better designed instruction set and register set. But once you are
using slightly more specific hardware features - vector processing,
floating point, acceleration for cryptography, etc., it's all much the
same. It takes roughly the same energy to do these things regardless of
the instruction set. Cache memory takes about the same power, as do PCI interfaces, memory interfaces, and everything else that takes up power
on a chip.
So when you have a relatively small device - such as what you need for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than x86. (If you go smaller - small embedded systems - x86 is totally non-existent because
an x86 microcontroller would be an order of magnitude bigger, more
expensive and power-consuming than an ARM core.) But when you have big processors for servers, and are using a significant fraction of the processor's computing power, the details of the core matter a lot less.
David Brown <david.brown@hesbynett.no> posted:
On 22/09/2025 17:28, Stefan Monnier wrote:
But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.
AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.
So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.
Yes, I think that is correct.
A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more efficient
at general integer work and other common actions, as a result of a
better designed instruction set and register set. But once you are
using slightly more specific hardware features - vector processing,
floating point, acceleration for cryptography, etc., it's all much the
same. It takes roughly the same energy to do these things regardless of
the instruction set. Cache memory takes about the same power, as do PCI
interfaces, memory interfaces, and everything else that takes up power
on a chip.
So when you have a relatively small device - such as what you need for a
mobile phone - the instruction set and architecture makes a significant
difference and ARM is a lot more power-efficient than x86. (If you go
smaller - small embedded systems - x86 is totally non-existent because
an x86 microcontroller would be an order of magnitude bigger, more
expensive and power-consuming than an ARM core.) But when you have big
processors for servers, and are using a significant fraction of the
processor's computing power, the details of the core matter a lot less.
Big servers have rather equal power in the peripherals {DISKs, SSDs, and NICs} and DRAM {plus power supplies and cooling} than in the cores.
David Brown <david.brown@hesbynett.no> posted:
On 22/09/2025 17:28, Stefan Monnier wrote:
But, AFAIK the ARM cores tend to use significantly less power
when emulating x86 than a typical Intel or AMD CPU, even if
slower.
AFAIK datacenters still use a lot of x86 CPUs, even though most
of them run software that's just as easily available for ARM.
And many datacenters care more about "perf per watt" than raw performance.
So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.
Yes, I think that is correct.
A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more
efficient at general integer work and other common actions, as a
result of a better designed instruction set and register set. But
once you are using slightly more specific hardware features -
vector processing, floating point, acceleration for cryptography,
etc., it's all much the same. It takes roughly the same energy to
do these things regardless of the instruction set. Cache memory
takes about the same power, as do PCI interfaces, memory
interfaces, and everything else that takes up power on a chip.
So when you have a relatively small device - such as what you need
for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than
x86. (If you go smaller - small embedded systems - x86 is totally non-existent because an x86 microcontroller would be an order of
magnitude bigger, more expensive and power-consuming than an ARM
core.) But when you have big processors for servers, and are using
a significant fraction of the processor's computing power, the
details of the core matter a lot less.
Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.
On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.
Still, CPU power often matters.
On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
David Brown <david.brown@hesbynett.no> posted:
On 22/09/2025 17:28, Stefan Monnier wrote:
But, AFAIK the ARM cores tend to use significantly less power
when emulating x86 than a typical Intel or AMD CPU, even if
slower.
AFAIK datacenters still use a lot of x86 CPUs, even though most
of them run software that's just as easily available for ARM.
And many datacenters care more about "perf per watt" than raw performance.
So, I think the difference in power consumption does not favor ARM nearly as significantly as you think.
Yes, I think that is correct.
A lot of it, as far as I have read, comes down to the type of calculation you are doing. ARM cores can often be a lot more
efficient at general integer work and other common actions, as a
result of a better designed instruction set and register set. But
once you are using slightly more specific hardware features -
vector processing, floating point, acceleration for cryptography,
etc., it's all much the same. It takes roughly the same energy to
do these things regardless of the instruction set. Cache memory
takes about the same power, as do PCI interfaces, memory
interfaces, and everything else that takes up power on a chip.
So when you have a relatively small device - such as what you need
for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than
x86. (If you go smaller - small embedded systems - x86 is totally non-existent because an x86 microcontroller would be an order of magnitude bigger, more expensive and power-consuming than an ARM
core.) But when you have big processors for servers, and are using
a significant fraction of the processor's computing power, the
details of the core matter a lot less.
Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.
Still, CPU power often matters.
Spec.org has special benchmark for that called SPECpower_ssj 2008.
It is old and java-oriented but I don't think that it is useless.
Right now the benchmark clearly shows that AMD offferings dominate
Intel's.
The best AMD score is 44168 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2025q2/power_ssj2008-20250407-01522.html
The best Intel score are 25526 ssj_ops/watt (Sierra Forest) and 25374 ssj_ops/watt (Granite Rapids). Both lag behind ~100 AMD scores,
They barely beats some old EPYC3 scores from 2021. https://www.spec.org/power_ssj2008/results/res2025q3/power_ssj2008-20250811-01533.html
https://www.spec.org/power_ssj2008/results/res2025q1/power_ssj2008-20250310-01505.html
There are very few non-x86 submissions. The only one that I found in
last 5 years was using Nvidia Grace CPU Superchip based on Arm Inc.
Neoverse V2 cores. It scored 13218 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2024q3/power_ssj2008-20240515-01413.html
Michael S <already5chosen@yahoo.com> posted:
A quick survey of the result database indicates only Oracle is
sending results to the data base.
Would be interesting to see the Apple/ARM comparisons.
On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:
On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Big servers have rather equal power in the peripherals {DISKs,
SSDs, and NICs} and DRAM {plus power supplies and cooling} than in
the cores.
Still, CPU power often matters.
Yes ... and no.
80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.
At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).
On Wed, 24 Sep 2025 15:56:37 -0400
George Neuner <gneuner2@comcast.net> wrote:
On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:
On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Big servers have rather equal power in the peripherals {DISKs,
SSDs, and NICs} and DRAM {plus power supplies and cooling} than in
the cores.
Still, CPU power often matters.
Yes ... and no.
80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.
I think that it's less than 80%. But it does not matter and does not
change anything - power spent for coooling is approximately
proportional to power spent for runninng.
At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).
Michael S <already5chosen@yahoo.com> writes:
On Wed, 24 Sep 2025 15:56:37 -0400
George Neuner <gneuner2@comcast.net> wrote:
80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.
At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).
A typical 16GB dimm module will dissipate 3-5 watts. So 128GB will
draw in the vincinity of 32 watts.
80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.
At the same time, most of the heat generated by typical systems is due
to the RAM - not the CPU(s).
On Wed, 24 Sep 2025 21:04:03 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
<...>
Scott,
When you answer George Neuner's point, can you, please, reply to George >Neuner's post rather than to mine?
Once I've read an article and restarted my newsreader, I don't have
access to read articles (at least not easily).
Once I've read an article and restarted my newsreader, I don't have access
to read articles (at least not easily).
On Thu, 25 Sep 2025 14:23:04 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Once I've read an article and restarted my newsreader, I don't have
access to read articles (at least not easily).
Does not it suck?
scott@slp53.sl.home (Scott Lurndal) writes:
Once I've read an article and restarted my newsreader, I don't have access >>to read articles (at least not easily).
I press the "Goto parent" button, and I think that already existed in >xrn-9.03,
On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:
On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.
Still, CPU power often matters.
Yes ... and no.
80+% of the power used by datacenters is devoted to cooling the
computers - not to running them. At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).
George Neuner wrote:
On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:
On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.
Still, CPU power often matters.
Yes ... and no.
80+% of the power used by datacenters is devoted to cooling the
computers - not to running them. At the same time, most of the heat generated by typical systems is due to the RAM - not the CPU(s).
I am quite sure that number is simply bogus: The power factors we were quoted when building the largest new datacenter in Norway 10+ years ago,
was more like 6-10% of total power for cooling afair.
. a quick google...
https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/
This one claims a 1.07 Power Usage Effectiveness.
Terje
Terje Mathisen <terje.mathisen@tmsw.no> posted:
I am quite sure that number is simply bogus: The power factors we were
quoted when building the largest new datacenter in Norway 10+ years ago,
was more like 6-10% of total power for cooling afair.
. a quick google...
https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/
This one claims a 1.07 Power Usage Effectiveness.
All of this depends on where the "cold sink" is !! and how cold it is.
Pumping 6ºC sea water through water to air heat exchangers is a lot
more power efficient than using FREON and dumping the heat into 37ºC
air.
I still suspect that rectifying and delivering clean (low noise) D/C
to the chassis' takes a lot more energy that taking the resulting heat
away.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
I am quite sure that number is simply bogus: The power factors we were
quoted when building the largest new datacenter in Norway 10+ years ago, >>> was more like 6-10% of total power for cooling afair.
. a quick google...
https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/
This one claims a 1.07 Power Usage Effectiveness.
All of this depends on where the "cold sink" is !! and how cold it is.
Pumping 6ºC sea water through water to air heat exchangers is a lot
more power efficient than using FREON and dumping the heat into 37ºC
air.
I still suspect that rectifying and delivering clean (low noise) D/C
to the chassis' takes a lot more energy that taking the resulting heat
away.
The FB article above describes how they reduced the
losses due to voltage changes as well as rectification.
Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.
George Neuner wrote:
On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:
On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Big servers have rather equal power in the peripherals {DISKs,
SSDs, and NICs} and DRAM {plus power supplies and cooling} than
in the cores.
Still, CPU power often matters.
Yes ... and no.
80+% of the power used by datacenters is devoted to cooling theI am quite sure that number is simply bogus: The power factors we
computers - not to running them. At the same time, most of the heat generated by typical systems is due to the RAM - not the CPU(s).
were quoted when building the largest new datacenter in Norway 10+
years ago, was more like 6-10% of total power for cooling afair.
.. a quick google...
https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/
This one claims a 1.07 Power Usage Effectiveness.
Terje
Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.
What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.
Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).
Where the second stage drop could use slightly cheaper transistors,
BGB <cr88192@gmail.com> schrieb:
Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.
I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.
What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.
That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.
Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).
Where the second stage drop could use slightly cheaper transistors,
Transistors?
On 9/25/2025 9:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.
Hmm...
Brings up a thought: 960VDC is a semi-common voltage in industrial >applications IIRC.
What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.
In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).
On 9/26/25 7:28 AM, Scott Lurndal wrote:
In those datacenters, the UPS distributes 48VDC to the rack components
(computers, network switches, storage devices, etc).
Is it still -48V?
Historically, Bell System plant voltage, supplied by batteries.
BGB <cr88192@gmail.com> writes:
On 9/25/2025 9:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.
Hmm...
Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.
What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.
In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).
On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
BGB <cr88192@gmail.com> schrieb:
Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.
I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.
Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp. brushlees, AC sync, AC async) enjoy similar popularity.
What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.
That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.
I never was in big datacenter, but heard that they prefer DC.
Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).
Where the second stage drop could use slightly cheaper transistors,
Transistors?
Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.
Higher voltage would be needed with DC vs AC, as DC is more subject to resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.
BGB <cr88192@gmail.com> posted: --------------------snip----------------------------------
Higher voltage would be needed with DC vs AC, as DC is more subject to
resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.
The military routinely uses 400 Hz to reduce the weight of transformers.
On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
BGB <cr88192@gmail.com> schrieb:
Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.
I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.
Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp. brushlees, AC sync, AC async) enjoy similar popularity.
What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.
That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.
I never was in big datacenter, but heard that they prefer DC.
Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).
Where the second stage drop could use slightly cheaper transistors,
Transistors?
Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.
BGB <cr88192@gmail.com> schrieb:
Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.
I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.
What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.
That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.
Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).
Where the second stage drop could use slightly cheaper transistors,
Transistors?
Michael S <already5chosen@yahoo.com> schrieb:
On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
BGB <cr88192@gmail.com> schrieb:
Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.
I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.
Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my
imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.
I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.
If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.
Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).
Where the second stage drop could use slightly cheaper transistors,
Transistors?
Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.
I'm more used to thyristors in that role.
And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.
David Brown <david.brown@hesbynett.no> schrieb:
And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM
controlled semiconductor switches.
If you have three phases (required for high-power industrial motors)
I believe people use the three phases directly to convert from three
phases to three phases.
The resulting waveforms are not pretty, and contribute to the
difficulty of measuing power input.
On 9/26/2025 9:28 AM, Scott Lurndal wrote:
In those datacenters, the UPS distributes 48VDC to the rack components
(computers, network switches, storage devices, etc).
48VDC also makes sense, as it is common in other contexts. I sorta
figured a higher voltage would have been used to reduce the wire
thickness needed.
I did realize after posting that, if the main power rails were organized
as a grid, the whole building could be done probably with 1.25" aluminum >bars.
Could power the grid of bars at each of the 4 corners, with maybe some >central diagonal bars (which cross and intersect with the central part
of the grid, and an additional square around the perimeter). Each corner >supply could drive 512A, and with this layout, no bar or segment should >exceed 128A.
BGB <cr88192@gmail.com> posted: >--------------------snip----------------------------------
Higher voltage would be needed with DC vs AC, as DC is more subject to
resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.
The military routinely uses 400 Hz to reduce the weight of transformers.
Something like 400 or 480Hz should also work.
On 27/09/2025 10:14, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
BGB <cr88192@gmail.com> schrieb:
Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.
I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.
Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my >>> imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.
I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.
If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.
These are not "DC motors" in the traditional sense, like brushed DC motors. The motors you use in a car have (roughly) sine wave drive signals, generally 3 phases (but sometimes more). Even motors referred
to as "Brushless DC motors" - "BLDC" - use AC inputs, though the
waveforms are more trapezoidal than sinusoidal.
And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.
Really, the distinction between "DC motor" and "AC motor" is mostly meaningless, other than for the smallest and cheapest (or oldest)
brushed DC motors.
Bigger brushed DC motors, as you say, used to be used in situations
where you needed speed control and the alternative was AC motors driven
at fixed or geared speeds directly from the 50 Hz or 60 Hz supplies. And
as you say, these were replaced by AC motors driven from frequency inverters. Asynchronous motors (or "induction motors") were popular at first, but are not common choices now for most use-cases because
synchronous AC motors give better control and efficiencies. (There are,
of course, many factors to consider - and sometimes asynchronous motors
are still the best choice.)
Or, 2-stage, say:
   960V -> 192V (with 960V to each rack).
   192V -> 12V (with 192V to each server).
Where the second stage drop could use slightly cheaper transistors,
Transistors?
Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.
I'm more used to thyristors in that role.
It's better, perhaps, to refer to "semiconductor switches" as a more
general term.
Thyristors are mostly outdated, and are only used now in very high power situations. Even then, they are not your granddad's thyristors, but
have more control for switching off as well as switching on - perhaps
even using light for the switching rather than electrical signals.
(Those are particularly nice for megavolt DC lines.)
You can happily switch multiple MW of power with a single IGBT module
for a could of thousand dollars. Or you can use SiC FETs for up to a
few hundred kW but with much faster PWM frequencies and thus better
control.
BGB <cr88192@gmail.com> writes:
On 9/25/2025 9:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.
Hmm...
Brings up a thought: 960VDC is a semi-common voltage in industrial >applications IIRC.
What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.
In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).
On 9/27/2025 6:52 AM, David Brown wrote:
On 27/09/2025 10:14, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
BGB <cr88192@gmail.com> schrieb:
Brings up a thought: 960VDC is a semi-common voltage in industrial >>>>>> applications IIRC.
I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.
Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors >>>> were that most wide-spread motors by far up to 25-30 years ago. But my >>>> imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.
I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.
If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.
These are not "DC motors" in the traditional sense, like brushed DC
motors. The motors you use in a car have (roughly) sine wave drive
signals, generally 3 phases (but sometimes more). Even motors
referred to as "Brushless DC motors" - "BLDC" - use AC inputs, though
the waveforms are more trapezoidal than sinusoidal.
Yes.
Typically one needs to generate a 3-phase waveform at the speed they
want to spin the motor at.
I had noted in some experience when writing some code to spin motors (typically on an MSP430, mostly experimentally) or similar:
 Sine waves give low noise, but less power;
 Square waves are noisier and only work well at low RPM,
   but have higher torque.
 Sawtooth waves seem to work well at higher RPMs.
   Well, sorta, more like sawtooth with alternating sign.
 Square-Root Sine: Intermediate between sign and square.
   Gives torque more like a square wave, but quieter.
   Trapezoid waves are similar to this, but more noise.
Seemingly, one "better" option might be to mutate the wave-shape between Square-Root-Sine and sawtooth depending on the target RPM. Also dropping
the wave amplitude at lower RPMs (at low RPMs motors pull more amperage
and thus generate a lot of heat otherwise).
And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using
PWM controlled semiconductor switches.
Yes:
 Dual-phase: may use a "Dual H-Bridge" configuration
   Where, the H-bridge is built using power transistors;
 Three-phase: "Triple Half-Bridge"
   Needs fewer transistors than dual phase.
It is slightly easier to build these drivers with BJTs or Darlington transistors, but these tend to handle less power and generate more heat,
but are more fault tolerant.
MOSFETs can handle more power, but one needs to be very careful not to exceed the Gain-Source voltage limit, otherwise they are insta-dead (and will behave as if they are shorted).
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,071 |
Nodes: | 10 (0 / 10) |
Uptime: | 186:19:51 |
Calls: | 13,762 |
Calls today: | 1 |
Files: | 186,985 |
D/L today: |
8,364 files (2,641M bytes) |
Messages: | 2,427,100 |