• Intel's Software Defined Super Cores

    From John Savard@quadibloc@invalid.invalid to comp.arch on Mon Sep 15 23:54:12 2025
    From Newsgroup: comp.arch

    When I saw a post about a new way to do OoO, I had thought it might be
    talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Sep 16 00:03:51 2025
    From Newsgroup: comp.arch

    On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    On further reflection, this may be equivalent to re-inventing out-of-order execution.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Sep 15 17:19:36 2025
    From Newsgroup: comp.arch

    On 9/15/2025 4:54 PM, John Savard wrote:
    When I saw a post about a new way to do OoO, I had thought it might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    Two weeks ago, I saw this in Tom's Hardware.

    https://www.tomshardware.com/pc-components/cpus/intel-patents-software-defined-supercore-mimicking-ultra-wide-execution-using-multiple-cores

    But at this point, it is just a patent. While it *might* get included
    in a future product, it seems a long way away, if ever.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Sep 15 17:56:28 2025
    From Newsgroup: comp.arch

    On 9/15/2025 4:54 PM, John Savard wrote:
    When I saw a post about a new way to do OoO, I had thought it might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    We would have to somehow tell the system that the program only uses a
    single thread, right? Not exactly sure how the sync is going to work
    with regard to multi-threaded and/or multi process programs?

    A single threaded program runs, then it calls into a function that
    creates a thread. Humm...


    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    Can one get something kind of akin to it by a clever use of affinity
    masks? But, those are not 100% guaranteed?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Sep 16 10:13:35 2025
    From Newsgroup: comp.arch

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    Sounds like [multiscalar processors](doi:multiscalar processor)


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Sep 16 10:15:04 2025
    From Newsgroup: comp.arch

    Sounds like [multiscalar processors](doi:multiscalar processor)
    ^^^^^^^^^^^^^^^^^^^^^
    10.1145/223982.224451

    [ I guess it can be useful to actully look at what one pasts before
    pressing "send", eh? ]


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Sep 16 15:10:09 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

    [ I guess it can be useful to actully look at what one pasts before
    pressing "send", eh? ]

    This is sooooo 2010's. Next, you'll be claming it makes sense to
    think before writing, and where would we be then? Not in the age
    of modern social media, that's for sure.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Sep 16 15:50:38 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    When I saw a post about a new way to do OoO, I had thought it might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.

    Andy Glew was working on stuff like this 10-15 years ago

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Tue Sep 16 13:01:30 2025
    From Newsgroup: comp.arch

    On Tue, 16 Sep 2025 00:03:51 -0000 (UTC), John Savard <quadibloc@invalid.invalid> wrote:

    On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

    Although it's called "inverse hyperthreading", this technique could be
    combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    On further reflection, this may be equivalent to re-inventing out-of-order >execution.

    John Savard

    Sounds more like dynamic micro-threading.

    Over the years I've seen a handful of papers about compile time micro-threading: that is the compiler itself identifies separable
    dependency chains in serial code and rewrites them into deliberate
    threaded code to be executed simultaneously.

    It is not easy to do under the best of circumstances and I've never
    seen anything about doing it dynamically at run time.

    To make a thread worth rehosting to another core, it would need to be
    (at least) many 10s of instructions in length. To figure this out
    dynamically at run time, it seems like you'd need the decode window to
    be 1000s of instructions and a LOT of "figure-it-out" circuitry.


    MMV, but to me it doesn't seem worth the effort.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Sep 17 11:54:09 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    When I saw a post about a new way to do OoO, I had thought it might be
    talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting
    programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it. >>
    Although it's called "inverse hyperthreading", this technique could be
    combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    Andy Glew was working on stuff like this 10-15 years ago

    That's what immediately fell to my mind as well, it looks a lot like
    trying some of his ideas about scouting micro-threads, doing work in the
    hope that it will turn out useful.

    To me it sounds like it is related to eager execution, except skipping
    further forward into upcoming code.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 17 14:34:09 2025
    From Newsgroup: comp.arch

    On Wed, 17 Sep 2025 11:54:09 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    When I saw a post about a new way to do OoO, I had thought it
    might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core-
    intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by
    splitting programs into chunks that can be performed in parallel
    on different cores, where the cores are intimately connected in
    order to make this work.

    This is a sound idea, but one may not find enough opportunities to
    use it.

    Although it's called "inverse hyperthreading", this technique
    could be combined with SMT - put the chunks into different threads
    on the same core, rather than on different cores, and then one
    wouldn't need to add extra connections between cores to make it
    work.

    Andy Glew was working on stuff like this 10-15 years ago

    That's what immediately fell to my mind as well, it looks a lot like
    trying some of his ideas about scouting micro-threads, doing work in
    the hope that it will turn out useful.

    To me it sounds like it is related to eager execution, except
    skipping further forward into upcoming code.

    Terje



    The question is what is most likely meaning of the fact of patenting?
    IMHO, it means that they explored the idea and decided against going in
    this particular direction in the near and medium-term future.

    I think that when Intel actually plans to use particular idea then they
    keep the idea secret for as long as they can and either don't patent at
    all or apply for patent after release of the product.
    I can be wrong about it.

    On the other hand,
    Some of the people that issued the patent appear to be leading figures
    in Intel's P-core teams. Some of them 1 year ago gave representations
    about advantages of removal of SMT. Removal of SMT and this super-core
    idea can be considered complimentary - both push into direction of
    cores with smaller # of EU pipes. So, may be, an idea was seriously
    considered for Intel products in mid-term future.
    Anyway, couple of months ago Tan himself said that Intel is reversing
    the decision to remove SMT. Which probably means that all their mid-term
    future plans are undergoing significant changes.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Sep 17 13:46:33 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    The question is what is most likely meaning of the fact of patenting?
    IMHO, it means that they explored the idea and decided against going in
    this particular direction in the near and medium-term future.

    I think that when Intel actually plans to use particular idea then they
    keep the idea secret for as long as they can and either don't patent at
    all or apply for patent after release of the product.
    I can be wrong about it.

    That would risk that somebody without patent exchange agreements with
    Intel patents the invention first (whether independently developed or
    due to a leak). Advantages of such a strategy: Companies with patent
    exchange agreements learn even later about the invention, and the
    patent expires at a later date.

    I remember an article about alias prediction (IIRC for executing
    stores before architecturally earlier loads), where the author read a
    patent from Intel and did some measurements on a released Intel CPU,
    and confirming that they actually implemented what the patent
    described.

    If you find that article, and compare the data when the patent was
    submitted to the date of the release of the processor, you can check
    your theory.

    Some of them 1 year ago gave representations
    about advantages of removal of SMT.

    I did not read any accounts of that that appeared particularly
    knowledgeable. What are the advantages, or where can I read about
    these presentations?

    Removal of SMT and this super-core
    idea can be considered complimentary - both push into direction of
    cores with smaller # of EU pipes.

    What do you mean by that? Narrower cores? In recent years cores seem
    to have exploded in width. From 1995 up to and including 2018 Intel
    produced 3-wide and 4-wide designs (with 4-wide coming IIRC with Sandy
    Bridge in 2011), and since then even the Skymont E-core has grown to
    8-wide, with 26 execution ports and 16-wide retirement. And other CPU manufacturers have also increased the widths of their CPUs.

    It seems that there has been a breakthrough in extracting ILP, making
    wider cores pay off better, a breakthrough in designing wider register
    renamers and making other structures wider, or both.

    Pushing for narrower cores appears unplausible to me at this stage.

    Concerning the removal of SMT, I can only guess, but that did not
    appear unplausible to me with Intel's hybrid CPUs: They have P-cores
    for fast single-thread performance, and lots of E-cores for
    multi-thread performance. You allocate threads that need
    single-thread performance to P-cores and threads that don't to
    E-cores. If you have even more tasks, i.e., a heavily multi-threaded
    load, do you want to slow down the threads that run on the P-cores by
    switching them to SMT mode, also increasing the already-high power
    consumption of the P-cores, lowering the clock of everything to stay
    within the power limit, and thus possibly the performance? If not,
    you don't need SMT.

    Still, after touting the SMT horn for so long, I don't expect that
    such considerations are the only ones. There must be a significant
    advantage in design complexity or die area when leaving it away
    (contradicting the earlier claim that SMT costs very little).

    Concerning super cores, whatever it is, my guess is that the idea is
    to try to extract even more performance from (as far as software is
    concerned) single-threaded programs than achievable with the wide
    cores of today.

    Anyway, couple of months ago Tan himself said that Intel is reversing
    the decision to remove SMT.

    On the servers, they do not follow the hybrid strategy, for whatever
    reason, so the thoughts above don't apply there. And maybe they found
    that the cloud providers want SMT, in order to sell their customers
    twice as many "CPUs".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Sep 17 13:07:49 2025
    From Newsgroup: comp.arch

    On 9/15/2025 6:54 PM, John Savard wrote:
    When I saw a post about a new way to do OoO, I had thought it might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.


    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

    Say, more cores and less power use, at the possible expense of some
    amount of performance.

    ...


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Sep 17 18:53:24 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.

    For a later perspective, see

    https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Sep 17 18:54:01 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 9/15/2025 6:54 PM, John Savard wrote:
    When I saw a post about a new way to do OoO, I had thought it might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.


    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

    Transmeta tried and failed to do this.

    Say, more cores and less power use, at the possible expense of some
    amount of performance.

    ...


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 17 23:00:15 2025
    From Newsgroup: comp.arch

    On Wed, 17 Sep 2025 18:53:24 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until
    Intel or AMD releases a new CPU that just sort of jettisons x86
    entirely at the hardware level, but then pretends to still be an
    x86 chip by running *everything* in a firmware level emulator via
    dynamic translation.

    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.


    Not really.

    First, translation on the fly does not count.

    Second, even for translation on the fly, only ancient K6 worked that
    way. Their later chip did a lot of work at the level of macro-ops,
    which in majority of cases have one-to-one correspondence to original
    x86 load-op and load-op-store instructions.

    Actually, I am not 100% sure about Bulldozer and derivatives, but K7,
    K8 and all generations of Zen are using macro-ops.

    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.


    Badly outdated text.

    For a later perspective, see

    https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Sep 17 20:19:14 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 9/15/2025 6:54 PM, John Savard wrote:
    When I saw a post about a new way to do OoO, I had thought it might be
    talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting
    programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it. >>
    Although it's called "inverse hyperthreading", this technique could be
    combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.


    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

    That was tried three decades ago. https://en.wikipedia.org/wiki/Transmeta


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Sep 17 21:33:17 2025
    From Newsgroup: comp.arch

    According to BGB <cr88192@gmail.com>:
    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

    That sounds a whole lot like what Transmeta did 25 years ago:

    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Sep 18 05:27:15 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

    Intel has already done so, although AFAIK not at the firmware level:
    Every IA-64 CPU starting with the Itanium II did not implement IA-32
    in hardware (unlike the Itanium), but instead used dynamic translation.

    There is no reason for Intel to repeat this mistake, or for anyone
    else to go there, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Sep 18 05:31:29 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the
    hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.

    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    That's nonsense; regulars of this groups should know better, at least
    this nonsense has been corrected often enough. E.g., I wrote in <2015Dec6.152525@mips.complang.tuwien.ac.at>:

    |Not even if the microcode the Intel and AMD chips used was really
    |RISC-like, which it was not (IIRC the P6 uses micro-instructions with
    |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop" |standing for "RISC").

    Another difference is that the OoO engine that sees the uOps performs
    only a very small part of the functionality of branches, with the
    majority performed by the front end. I.e., there is no branching in
    the OoO engine that sees the uOps, at the most it confirms the branch prediction, or diagnoses a misprediction, at which point the OoO
    engine is out of a job and has to wait for the front end; possibly
    only the ROB (which deals with instructions again) resolves the
    misprediction and kicks the front end into action, however.

    As Mitch Alsup has written, AMD has its MacroOps (load-op and RMW) in
    addition to the Rops. It's not entirely clear which parts of the
    engine see MacroOps and ROPs, but my impression was that the MacroOps
    are not split into ROPs for the largest part of the OoO engine.

    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.

    From 1998. Unfortunately, there are not many more recent books about
    the microarchitecture of OoO CPUs. What I have found:

    Modern Processor Design: Fundamentals of Superscalar Processors
    John Paul Shen, Mikko H. Lipasti
    McGraw-Hill
    656 pages
    published 2004 or so (don't let the 2013 date from the reprint fool you) Discusses CPU design (not just OoO) using various real CPUs from the
    1990s as example.

    Processor Microarchitecture -- An Implementation Perspective
    Antonio Gonzalez , Fernando Latorre , Grigorios Magklis
    Springer
    published 2010
    Relatively short, discusses the various parts of an OoO CPU and how to implement them.

    Henry Wong
    A Superscalar Out-of-Order x86 Soft Processor for FPGA
    Ph.D. thesis, U. Toronto
    https://www.stuffedcow.net/files/henry-thesis-phd.pdf
    Slides: https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf published 2017

    A problem is that the older books don't cover recent developments such
    as alias prediction and that Wong was limited by what a single person
    can do (his work was not part of a larger research project at
    U. Toronto), as well as what fits into an FPGA.

    BTW, Wong's work can be seen as a refutation of BGB's statement: He
    chose to implement IA-32; on slide 14 of <https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf> he
    states "It’s easy to implement!".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Sep 18 06:14:30 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes: >https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.

    It definitely was. However, even a modern high-performance OoO cores
    like Apple M1-M4's P-cores or on Qualcomm's Oryon, the performance of dynamically-translated AMD64 code is usually slower than on comparable
    CPUs from Intel and AMD.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Sep 18 03:39:57 2025
    From Newsgroup: comp.arch

    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB <cr88192@gmail.com>:
    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the
    hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.

    That sounds a whole lot like what Transmeta did 25 years ago:

    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.


    Might be different now:
    25 years ago, Moore's law was still going strong, and the general
    concern was more about maximizing scalar performance rather than energy efficiency or core count (and, in those days, processors were generally single-core).


    Now we have a different situation:
    Moore's law is dying off;
    Scalar CPU performance has hit a plateau;
    And, for many uses, performance is "good enough";
    A lot more software can make use of multi-threading;
    ...


    Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
    well, whereas something like a RISC style ISA can get better performance
    on a comparably smaller and cheaper core, and with a somewhat better "performance per watt" metric.


    So, one possibility could be, rather than a small number of big/fast
    cores (either VLIW or OoO), possibly a larger number of smaller cores.

    The cores could maybe be LIW or in-order RISC.




    One possibility could be that virtual processors don't run on a single
    core, say:
    The logical cores exist more as VMs each running a virtual x86 processor
    core;
    The dynamic translation doesn't JIT translate to a linear program.

    Say:
    Breaks code into traces;
    Each trace uses something akin to CSP mixed with Pi-Calculus;
    Address translation is explicit in the ISA, with specialized ISA level memory-ordering and control-flow primitives.

    For example, there could be special ISA level mechanisms for submitting
    a job to a local job-queue, and pulling a job from the queue.
    Memory accesses could use a special "perform a memory access or branch-subroutine" instruction ("MEMorBSR"), where the MEMorBSR
    operations will try to access memory, either continuing to the next instruction (success) or Branching-to-Subroutine (access failed).

    Where the failure cases could include (but not limited to) TLB miss;
    access fault; memory ordering fault; ...

    The "memory ordering fault" case could be, when traces are submitted to
    the queue, if they access memory, they are assigned sequence numbers
    based on Load and Store operations. When memory is accessed, the memory
    blocks in the cache could be marked with sequence numbers when read or modified. On access, it could detect if/when memory access have
    out-of-order sequence numbers, and then fall back to special-case
    handling to restore the intended order (reverting any "uncommitted"
    writes, and putting the offending blocks back into the queue to be
    re-run after the preceding blocks have finished).

    Possibly, the caches wouldn't directly commit stores to memory, but
    instead could keep track of a group of cache lines as an "in-flight" transaction. In this case, it could be possible for a "logically older"
    block to see the memory as it was before a more recent transaction, but
    an out-of-order write could be detected via sequence numbers (if seen,
    it would mean a "future" block had run but had essentially read stale data).

    Once a block is fully committed (after all preceding blocks are
    finished) its contents can be written back out to main RAM.
    Could be held in an area of RAM local to the group of cores running the logical core.

    Possibly, such a core might actually operate in multiple address spaces:
    Virtual Memory, via the transaction oriented MEMorBSR mechanism;
    There would likely be an explicit TLB here.
    So, TLB Miss handling could be essentially a runtime call.
    Local Memory:
    Physical Address, small non-externally-visible SRAM;
    Divided into Core-Local and Group-Shared areas;
    Physical Memory:
    External DRAM or similar;
    Resembles more traditional RAM access (via Load/Store Ops);
    Could be used for VM tasks and page-table walks.


    Would likely require significant hardware level support for things like job-queues and synchronization mechanisms.

    One possibility could be that some devices could exist local to a group
    of cores, which then have a synchronous "first come, first serve" access pattern (possibly similar to how my existing core design manages MMIO).

    Possibly it could work by passing fixed-size messages over a bus, with
    each request/response pair to a device being synchronous.


    Possibly the JIT could try to infer possible memory aliasing between
    traces, and enforce sequential ordering if alias is likely. This because performing the operations in the correct order the first time is likely
    to be cheaper than detecting an ordering violation and rolling back a transaction.

    Whereas proving that traces can't alias is likely to be a much harder
    problem than inferring a probable absence of aliasing. If no order
    violations occur during operation, it can be safely assumed that no
    memory aliasing happened.

    Maintaining transactions would complicate the cache design though, since
    now there is a problem that the cache line can't be written back or
    evicted until its write-associated sequence is fully committed.

    Might also need to be separate queue spots for "tasks currently being
    worked on" vs "to be done after the current jobs are done". Say, for
    example, if a job needs to be rolled-back and re-run, it would still
    need to come before jobs that are further in the future relative to itself.

    Unlike memory, register ordering is easier to infer statically, at least
    in the absence of dynamic branching.

    Might need to enforce ordering in cases where:
    Dynamic branch occurs and the path can't be followed statically;
    A following trace would depend on a register modified in a preceding trace;
    ...



    As for how viable any of this is, I don't know...

    The VM could be a lot simpler if one assumes a single threaded VM.


    Also unclear is if an ISA could be designed in a way to keep overheads
    low enough (would be a waste if the multi-threaded VM is slower than a
    single threaded VM would have been). But, this would require a lot of
    exotic mechanisms, so dunno...

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Sep 18 03:58:16 2025
    From Newsgroup: comp.arch

    On 9/18/2025 1:14 AM, Anton Ertl wrote:
    John Levine <johnl@taugh.com> writes:
    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.

    It definitely was. However, even a modern high-performance OoO cores
    like Apple M1-M4's P-cores or on Qualcomm's Oryon, the performance of dynamically-translated AMD64 code is usually slower than on comparable
    CPUs from Intel and AMD.


    But, AFAIK the ARM cores tend to use significantly less power when
    emulating x86 than a typical Intel or AMD CPU, even if slower.

    Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

    Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
    slower, it may still win in Perf/W and similar...

    Then there is also Perf/$, and if such a CPU can win in both Perf/W and Perf/$, then it can still win even if it is slower, by throwing more
    cores at the problem.


    Though, the possibly interesting idea could be trying for a
    multi-threaded translation rather than a single threaded translation.
    But, to have any hope, a multi-threaded translation is likely to need
    exotic ISA features; whereas a single threaded VM could probably run
    mostly OK on normal ARM or RISC-V or similar (well, assuming a world
    where RiSC-V addresses some more of its weak areas; but then again, with recent proposals for indexed load/store and auto-increment popping up,
    this is starting to look more likely...).


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 18 17:51:36 2025
    From Newsgroup: comp.arch

    On Thu, 18 Sep 2025 05:27:15 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    BGB <cr88192@gmail.com> writes:
    Still sometimes it seems like it is only a matter of time until
    Intel or AMD releases a new CPU that just sort of jettisons x86
    entirely at the hardware level, but then pretends to still be an x86
    chip by running *everything* in a firmware level emulator via
    dynamic translation.

    Intel has already done so, although AFAIK not at the firmware level:
    Every IA-64 CPU starting with the Itanium II did not implement IA-32
    in hardware (unlike the Itanium), but instead used dynamic
    translation.


    That's imprecise.
    First couple of generations of Itanium 2 (McKinley, Madison) still had
    IA-32 hardware. Gone in Montecito (2006).
    Dynamic translation of application code was available much earlier,
    indeed, but early removal of [crappy] hardware colution was probably
    considered too risky.



    There is no reason for Intel to repeat this mistake, or for anyone
    else to go there, either.

    - anton

    As said by just about everybody, BGB's proposal is most similar
    to Transmeta. What was not said by everybody is that similar approach
    was tried for Arm, by NVidia none the less. https://en.wikipedia.org/wiki/Project_Denver



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 18 16:16:54 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    With a very loose definition of RISC::

    a)Does a RISC ISA contain memory reference address generation from
    the pattern [Rbase+Rindex<<scale+Displacement] ??
    Some will argue yes, others no.

    b) does a RISC ISA contain memory reference instructions that are
    combined with arithmetic calculations ??
    Some will argue yes, others no.

    c) does a RISC ISA contain memory reference instructions that
    access memory twice ??? LD-OP-ST :: but the TLB only once ?!?
    Most would argue no.

    Yet, this is the µISA of K7 and K8. It is only RISC in the very
    loosest sense of the word.

    And do not get me started on the trap/exception/interrupt model.


    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.

    For a later perspective, see

    https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Sep 18 12:33:44 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
    hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.
    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    That's nonsense; regulars of this groups should know better, at least
    this nonsense has been corrected often enough. E.g., I wrote in <2015Dec6.152525@mips.complang.tuwien.ac.at>:

    |Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions with |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop" |standing for "RISC").

    I don't know what you are objecting to - Intel calls its internal
    instructions micro-operations or uOps, and AMD calls its Rops.
    The term is widely used to mean something that executes internally.
    Beyond that it depends on the specific of each micro-architecture.

    The number of bits has nothing to do with what it is called.
    If this uOp was for a ROB style design where all the knowledge about
    each instruction including register ids, immediate data,
    scheduling info, result data, status, is stored in a single ROB entry,
    then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.

    Another difference is that the OoO engine that sees the uOps performs
    only a very small part of the functionality of branches, with the
    majority performed by the front end. I.e., there is no branching in
    the OoO engine that sees the uOps, at the most it confirms the branch prediction, or diagnoses a misprediction, at which point the OoO
    engine is out of a job and has to wait for the front end; possibly
    only the ROB (which deals with instructions again) resolves the
    misprediction and kicks the front end into action, however.

    And a uOp triggers that action sequence.
    I don't see the distinction you are trying to make.

    As Mitch Alsup has written, AMD has its MacroOps (load-op and RMW) in addition to the Rops. It's not entirely clear which parts of the
    engine see MacroOps and ROPs, but my impression was that the MacroOps
    are not split into ROPs for the largest part of the OoO engine.

    AMD explains there terminology here but note that the relationship
    between Macro-Ops and Micro-Ops is micro-architecture specific.

    A Seventh-Generation x86 Microprocessor, 1999 https://www.academia.edu/download/70925991/4.79985120211001-19357-4pufup.pdf

    "An [micro-]OP is the minimum executable entity understood by the machine."
    A macro-op is a bundle of 1 to 3 micro-ops.
    Simple instructions map to 1 macro and 1-3 micro ops
    and this mapping is done in the decoder.
    Complex instructions map to one or more "micro-lines" each of which
    consists of 3 macro-ops (of 1-3 micro-ops each) pulled from micro-code ROM.

    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.

    From 1998. Unfortunately, there are not many more recent books about
    the microarchitecture of OoO CPUs. What I have found:

    Modern Processor Design: Fundamentals of Superscalar Processors
    John Paul Shen, Mikko H. Lipasti
    McGraw-Hill
    656 pages
    published 2004 or so (don't let the 2013 date from the reprint fool you) Discusses CPU design (not just OoO) using various real CPUs from the
    1990s as example.

    Processor Microarchitecture -- An Implementation Perspective
    Antonio Gonzalez , Fernando Latorre , Grigorios Magklis
    Springer
    published 2010
    Relatively short, discusses the various parts of an OoO CPU and how to implement them.

    Henry Wong
    A Superscalar Out-of-Order x86 Soft Processor for FPGA
    Ph.D. thesis, U. Toronto https://www.stuffedcow.net/files/henry-thesis-phd.pdf
    Slides: https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf published 2017

    A problem is that the older books don't cover recent developments such
    as alias prediction and that Wong was limited by what a single person
    can do (his work was not part of a larger research project at
    U. Toronto), as well as what fits into an FPGA.

    BTW, Wong's work can be seen as a refutation of BGB's statement: He
    chose to implement IA-32; on slide 14 of <https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf> he
    states "It’s easy to implement!".

    - anton

    Other micro-architecture related sources since 2000:

    Book
    A Primer on Memory Consistency and Cache Coherence 2nd Ed, 2020
    Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood

    Dissertation
    Complexity and Correctness of a Super-Pipelined Processor, 2005
    Jochen Preiß

    Book
    General-Purpose Graphics Processor Architectures, 2018
    Aamodt, Wai Lun Fung, Rogers

    Book
    Microprocessor Architecture
    From Simple Pipelines to Chip Multiprocessors, 2010
    Jean-Loup Baer

    Book
    Processor Microarchitecture An Implementation Perspective, 2011
    Antonio González, Fernando Latorre, and Grigorios Magklis

    This is a bit introductory level:

    Book
    Computer Organization and Design
    The Hardware/Software Interface: RISC-V Edition, 2018
    Patterson, Hennessy


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 18 20:26:29 2025
    From Newsgroup: comp.arch

    On Thu, 18 Sep 2025 12:33:44 -0400
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until
    Intel or AMD releases a new CPU that just sort of jettisons x86
    entirely at the hardware level, but then pretends to still be an
    x86 chip by running *everything* in a firmware level emulator via
    dynamic translation.
    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    That's nonsense; regulars of this groups should know better, at
    least this nonsense has been corrected often enough. E.g., I wrote
    in <2015Dec6.152525@mips.complang.tuwien.ac.at>:

    |Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions
    with |around 100bits, and the K7 has a read-write Rop (with the "R"
    of "Rop" |standing for "RISC").

    I don't know what you are objecting to - Intel calls its internal instructions micro-operations or uOps, and AMD calls its Rops.


    No, they don't. They stopped using term Rops almost 25 years ago.
    If they used it in early K7 manuals then it was due to inertia (K6
    manuals copy&pasted without much of thought given) and partly because
    of marketing, because RISC was considered cool.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Sep 18 14:42:36 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Thu, 18 Sep 2025 12:33:44 -0400
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until
    Intel or AMD releases a new CPU that just sort of jettisons x86
    entirely at the hardware level, but then pretends to still be an
    x86 chip by running *everything* in a firmware level emulator via
    dynamic translation.
    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.
    That's nonsense; regulars of this groups should know better, at
    least this nonsense has been corrected often enough. E.g., I wrote
    in <2015Dec6.152525@mips.complang.tuwien.ac.at>:

    |Not even if the microcode the Intel and AMD chips used was really
    |RISC-like, which it was not (IIRC the P6 uses micro-instructions
    with |around 100bits, and the K7 has a read-write Rop (with the "R"
    of "Rop" |standing for "RISC").
    I don't know what you are objecting to - Intel calls its internal
    instructions micro-operations or uOps, and AMD calls its Rops.


    No, they don't. They stopped using term Rops almost 25 years ago.
    If they used it in early K7 manuals then it was due to inertia (K6
    manuals copy&pasted without much of thought given) and partly because
    of marketing, because RISC was considered cool.

    And the fact that all the RISC processors ran rings around the CISC ones.
    So they wanted to promote that "hey, we can go fast too!"

    Ok, AMD dropped the "risc" prefix 25 years ago.
    That didn't change the way it works internally.

    They still use the term "micro op" in the Intel and AMD Optimization guides.
    It still means an micro-architecture specific internal simple, discrete
    unit of execution, albeit a more complex one as transistor budgets allow.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Sep 18 14:05:04 2025
    From Newsgroup: comp.arch

    On 9/18/2025 11:16 AM, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
    hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.

    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    With a very loose definition of RISC::

    a)Does a RISC ISA contain memory reference address generation from
    the pattern [Rbase+Rindex<<scale+Displacement] ??
    Some will argue yes, others no.

    b) does a RISC ISA contain memory reference instructions that are
    combined with arithmetic calculations ??
    Some will argue yes, others no.

    c) does a RISC ISA contain memory reference instructions that
    access memory twice ??? LD-OP-ST :: but the TLB only once ?!?
    Most would argue no.

    Yet, this is the µISA of K7 and K8. It is only RISC in the very
    loosest sense of the word.

    And do not get me started on the trap/exception/interrupt model.



    Still reminds me of the LOL of some of the old marketing for the TI
    MSP430 trying to pass it off as RISC:
    In practice has variable-length instructions (via @PC+ addressing);
    Has auto-increment addressing modes and similar;
    Most instructions can operate directly on memory;
    Has ability to do Mem/Mem operations;
    ...

    In effect, MSP430 being closer to the DEC PDP-11 than it was to much of anything else in the RISC family.

    Even SuperH, which also branched off from similar origins, had gone over
    to purely 16-bit instructions, and was Load/Store, so more deserving of
    the RISC title (though apparently still a lot more PDP-11 flavored than
    MIPS flavored).


    Their rationale: "But our instruction listing isn't very long, so RISC", nevermind all of the edge cases they hid off in the various addressing
    modes and register combinations.

    But, yeah, following similar logic to what TI was using, one could look
    at something like the Motorola 68000 and be all like, "Yep, looks like
    RISC to me"...


    ...




    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.

    For a later perspective, see

    https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 18 22:56:22 2025
    From Newsgroup: comp.arch

    On Thu, 18 Sep 2025 14:42:36 -0400
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Thu, 18 Sep 2025 12:33:44 -0400
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until
    Intel or AMD releases a new CPU that just sort of jettisons x86
    entirely at the hardware level, but then pretends to still be an
    x86 chip by running *everything* in a firmware level emulator
    via dynamic translation.
    For AMD, that has happend already a few decades ago; they
    translate x86 code into RISC-like microops.
    That's nonsense; regulars of this groups should know better, at
    least this nonsense has been corrected often enough. E.g., I
    wrote in <2015Dec6.152525@mips.complang.tuwien.ac.at>:

    |Not even if the microcode the Intel and AMD chips used was really
    |RISC-like, which it was not (IIRC the P6 uses micro-instructions
    with |around 100bits, and the K7 has a read-write Rop (with the
    "R" of "Rop" |standing for "RISC").
    I don't know what you are objecting to - Intel calls its internal
    instructions micro-operations or uOps, and AMD calls its Rops.


    No, they don't. They stopped using term Rops almost 25 years ago.
    If they used it in early K7 manuals then it was due to inertia (K6
    manuals copy&pasted without much of thought given) and partly
    because of marketing, because RISC was considered cool.

    And the fact that all the RISC processors ran rings around the CISC
    ones.

    In 1988. In 1998 - much less so.

    So they wanted to promote that "hey, we can go fast too!"

    Ok, AMD dropped the "risc" prefix 25 years ago.
    That didn't change the way it works internally.


    Of course, they did. Several times.
    Even Zen3 works non-trivially differently from Zen1 and 2.
    If you stopped following in previous millenium it's your problem rather
    than their.

    They still use the term "micro op" in the Intel and AMD Optimization
    guides. It still means an micro-architecture specific internal
    simple, discrete unit of execution, albeit a more complex one as
    transistor budgets allow.


    By that logic every CISC is RISC, because at some internal level they
    executes simple operations. Even those with load-ALU pipeline do load
    and ALU at separate stages.







    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 19 09:50:32 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB <cr88192@gmail.com>:
    Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
    hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.

    That sounds a whole lot like what Transmeta did 25 years ago:

    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.


    Might be different now:
    25 years ago, Moore's law was still going strong, and the general
    concern was more about maximizing scalar performance rather than energy >efficiency or core count (and, in those days, processors were generally >single-core).

    IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
    2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
    would be a good idea nowadays, it would not have been canceled.

    How should the number of cores change anything? If you cannot make single-threaded IA-32 or AMD64 programs run at competetive speeds on
    IA-64 hardware, how would that inefficiency be eliminated in
    multi-threaded programs?

    Now we have a different situation:
    Moore's law is dying off;

    Even if that is the case, how should that change anything about the
    relative merits of the two approaches?

    Scalar CPU performance has hit a plateau;

    True, but again, what's the relevance for the discussion at hand?

    And, for many uses, performance is "good enough";

    In that case, better buy a cheaper AMD64 CPU rather than a
    particularly fast CPU with a different architecture X and then run a
    dynamic AMD64->X translator on it.

    A lot more software can make use of multi-threading;

    Possible, but how would it change things?

    Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
    well, whereas something like a RISC style ISA can get better performance
    on a comparably smaller and cheaper core, and with a somewhat better >"performance per watt" metric.

    Evidence?

    Yes, you can run CPUs with Intel P-cores and AMD's non-compact cores
    with higher power limits than what the Apple and Qualcomm chips
    approximately consume (I have not seen proper power consumption
    numbers for these since Anandtech stopped publishing), but you can
    also run Intel CPUs and AMD CPUs at low power limits, with much better "performance per watt". It's just that many buyers of these CPUs care
    about performance, not performance per watt.

    And if you run AMD64 software on your binary translator on CPUs with
    e.g., ARM A64 architecture, the performance per watt is worse than
    when running it on an AMD64 CPU.

    So, one possibility could be, rather than a small number of big/fast
    cores (either VLIW or OoO), possibly a larger number of smaller cores.

    The cores could maybe be LIW or in-order RISC.

    The approach of a large number of small, slow cores has been tried,
    e.g., in the TILE64, but has not been successful with that core size.
    Other examples are Sun's UltraSparc T1000 and followons, which were
    somewhat more successful, but eventually led to the cancellation of
    SPARC.

    Finally, Intel now offers E-core-only chips for clients (e.g., N100)
    and servers (Sierra Forest), but they have not stopped releasing
    P-Core-only server CPUs. For the desktop the CPU with the largest
    numbers of E-Cores (16) also hase 8 P-cores, so Intel obviously
    believes that not all desktop applications are embarrassingly
    parallel.

    Intel used to have Xeon Phi CPUs with a higher number of narrower
    cores, but eventually replaced them with Xeon processors that have
    fewer, but more powerful cores.

    AMD offers compact-core-only server CPUs with more cores and less
    cache per core, but otherwise the same microarchitecture, only with a
    much lower clock ceiling. (There is a difference in microarchitecture
    wrt execurting AVX-512 instructions on Zen5, but that's minor). AMD
    also offers server CPUs with non-compact cores; interestingly, if we
    compare CPUs with the same numbers of cores, the launch price (at the
    same date) is not that far apart:

    GHz
    Model cores base boost cache TDP launch current
    EPYC 9755 128 2.7 4.1 512MB 500W USD12984 EUR5979
    EPYC 9745 128 2.3 3.7 256MB 400W USD12141 EUR4192

    Current pricing from <https://geizhals.eu/?cat=cpuamdam4&xf=12099_Server~25_128~596_Turin~596_Turin+Dense>;
    however, the third-cheapest dealer for the 9745 asks for EUR 6129, and
    the cheapest price up to 2025-09-10 has been EUR 6149, so the current
    price difference may be short-lived. The cheapest price for the 9755
    was 4461 on 2025-08-25, and at that time the 9755 was cheaper than the
    9745 (at least as far as the prices seen by the website above are
    concerned).

    I have thought about why the idea of more smaller cores has not been
    more successful, at least for the kinds of loads where you have a
    large number of independent and individually not particularly
    demanding threads, as in web shops. My explanation is that you need
    1) memory bandwidth and 2) interconnection with the rest of the
    system.

    The interconnection with the rest of the system probably does
    not get much cheaper for the smaller cores, and probably becomes more
    expensive with more cores (e.g., Intel switched from a ring to a grid
    when they increased the cores in their server chips).

    The bandwidth requirements to main memory for given cache sizes per
    core reduce linearly with the performance of the cores; if the larger
    number of smaller cores really leads to increased aggregate
    performance, additional main memory bandwidth is needed, or you can
    compensate for that with larger caches.

    But to eliminate some variables, let's just consider the case where we
    want to get the same performance with the same main memory bandwidth
    from using more smaller cores than we use now. Will the resulting CPU
    require less area? The cache sizes per core are not reduced, and
    their area is not reduced much. The core itself will get smaller, and
    its performance will also get smaller (although by less than the
    core). But if you sum up the total per-core area (core, caches, and interconnect), at some point the per-core area reduces by less than
    the per-core performance, so for a given amount of total performance,
    the area goes up.

    There is one counterargument to these considerations: The largest
    configuration of Turin dense has less cache for more cores than the
    largest configuration of Turin. I expect that's the reason why they
    offer both; if you have less memory-intensive loads, Turin dense with
    the additional cores will give you more performance, otherwise you
    better buy Turin.

    Also, Intel has added 16 E-Cores to their desktop chips without giving
    them the same amount of caches as the P-Cores; e.g., in Arrow lake we
    have

    P-core 48KB D-L0 64KB I-L1 192KB D-L1 3MB L2 3MB L3/core
    E-Core 32KB D-L1 64KB I-L1 4MB L2/4 cores 3MB L3/4cores

    Here we don't have an alternative with more P-Cores and the same
    bandwidth, so we cannot contrast the approaches. But it's certainly
    the case that if you have a bandwidth-hungry load, you don't need to
    buy the Arrow Lake with the largest number of E-Cores.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 19 14:33:44 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

    And most Intel and AMD chips have 150W TDP, either, although the
    shenanigans they play with TDP are not nice. The usual TDP for
    Desktop chips is 65W (with the power limits temporarily or permanently
    higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
    a configurable TDP of 8-37W.

    Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >slower, it may still win in Perf/W and similar...

    No TDP numbers are given for Oryon. For Apple's M4, the numbers are

    M4 4P 6E 22W
    M4 Pro 8P 4E 38W
    M4 Pro 10P 4E 46W
    M4 Max 10P 4E 62W
    M4 Max 12P 4E 70W

    Not quite 1/30th of the power, although I think that Apple does not
    play the same shenanigans as Intel and AMD.

    [RISC-V]
    recent proposals for indexed load/store and auto-increment popping up,

    Where can I read about that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Sep 19 18:12:38 2025
    From Newsgroup: comp.arch

    On Fri, 19 Sep 2025 09:50:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    I have thought about why the idea of more smaller cores has not been
    more successful, at least for the kinds of loads where you have a
    large number of independent and individually not particularly
    demanding threads, as in web shops. My explanation is that you need
    1) memory bandwidth and 2) interconnection with the rest of the
    system.

    The interconnection with the rest of the system probably does
    not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
    when they increased the cores in their server chips).


    That particualr problem is addressed by grouping smaller cores into
    clusters with shared L2 cache. It's especially effective for scaling
    when L2 cache is true inclusive relatively to underlying L1 caches.
    The price is limited L2 bandwidth as seen by the cores.

    BTW, I didn't find any info about replacement policy of Intel's Sierra
    Forest L2 caches.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 19 15:05:56 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.
    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    That's nonsense; regulars of this groups should know better, at least
    this nonsense has been corrected often enough. E.g., I wrote in
    <2015Dec6.152525@mips.complang.tuwien.ac.at>:

    |Not even if the microcode the Intel and AMD chips used was really
    |RISC-like, which it was not (IIRC the P6 uses micro-instructions with
    |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop"
    |standing for "RISC").

    I don't know what you are objecting to

    I am objecting to the claim that uops are RISC-like, and that there is
    a translation to RISC occuring inside the CPU, and (not present here,
    but often also claimed) that therefore there is no longer a difference
    between RISC and non-RISC.

    One can discuss the details, but at the end of the day, uops are some implementation-specific internals of the microarchitecture, whereas a
    RISC architecture is an architecture.

    The number of bits has nothing to do with what it is called.
    If this uOp was for a ROB style design where all the knowledge about
    each instruction including register ids, immediate data,
    scheduling info, result data, status, is stored in a single ROB entry,
    then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.

    Yes, P6 is the code name for the Pentium Pro, which has a ROB, and,
    more importantly valued reservation stations, and yes, the 118 or
    whatever bits include the operands. I have no idea how the P6 handles
    its 80-bit FP with valued RSs; maybe it has bigger uops in its FP part
    (but I think it has a unified scheduler, so that would not work out,
    or maybe I miss something).

    But concerning the discussion at hand: Containing the data is a
    significant deviation from RISC instruction sets, and RISC
    instructions are typically only 32 bits or 16 bits wide.

    Another difference is that the OoO engine that sees the uOps performs
    only a very small part of the functionality of branches, with the
    majority performed by the front end. I.e., there is no branching in
    the OoO engine that sees the uOps, at the most it confirms the branch
    prediction, or diagnoses a misprediction, at which point the OoO
    engine is out of a job and has to wait for the front end; possibly
    only the ROB (which deals with instructions again) resolves the
    misprediction and kicks the front end into action, however.

    And a uOp triggers that action sequence.
    I don't see the distinction you are trying to make.

    The major point is that the OoO engine (the part that deals with uops)
    sees a linear sequence of uops it has to process, with nearly all
    actual branch processing (which an architecture has to do) done in a
    part that does not deal with uops. With the advent of uop caches that
    has changed a bit, but many of the CPUs for which the uop=RISC claim
    has been made do not have an uop cache.

    It's not entirely clear which parts of the
    engine see MacroOps and ROPs, but my impression was that the MacroOps
    are not split into ROPs for the largest part of the OoO engine.

    AMD explains there terminology here but note that the relationship
    between Macro-Ops and Micro-Ops is micro-architecture specific.

    A Seventh-Generation x86 Microprocessor, 1999 >https://www.academia.edu/download/70925991/4.79985120211001-19357-4pufup.pdf

    "An [micro-]OP is the minimum executable entity understood by the machine."
    A macro-op is a bundle of 1 to 3 micro-ops.
    Simple instructions map to 1 macro and 1-3 micro ops
    and this mapping is done in the decoder.
    Complex instructions map to one or more "micro-lines" each of which
    consists of 3 macro-ops (of 1-3 micro-ops each) pulled from micro-code ROM.

    Yes, so much is clear. It's not clear where Macro-Ops are in play and
    where Micro-Ops are in play. Over time I get the impression that the
    macro-ops are the main thing running through the OoO engine, and
    Micro-Ops are only used in specific places, but it's completely
    unclear to me where. E.g., if they let an RMW Macro-Op run through
    the OoO engine, it would first go to the LSU for the address
    generation, translation and load, then to the ALU for the
    modification, then to the LSU for the store, and then to the ROB.
    Where in this whole process is a Micro-Op actually stored?

    This is a bit introductory level:

    Book
    Computer Organization and Design
    The Hardware/Software Interface: RISC-V Edition, 2018
    Patterson, Hennessy

    Their "Computer Architecture" book is also revised every few years,
    but their treatment of OoO makes me think that they are not at all
    interested in that part anymore, instead more in, e.g., multiprocessor
    memory subsystems.

    And the fact that we see so few recent books on the topics makes me
    think that many in academia have decided that this is a topic that
    they leave to industry.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 19 16:14:53 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:
    -------------------------------

    Yes, so much is clear. It's not clear where Macro-Ops are in play and
    where Micro-Ops are in play. Over time I get the impression that the macro-ops are the main thing running through the OoO engine, and
    Micro-Ops are only used in specific places, but it's completely
    unclear to me where. E.g., if they let an RMW Macro-Op run through
    the OoO engine, it would first go to the LSU for the address
    generation, translation and load, then to the ALU for the
    modification, then to the LSU for the store, and then to the ROB.
    Where in this whole process is a Micro-Op actually stored?

    In the reservation station.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 19 16:23:06 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB <cr88192@gmail.com>:
    --------------------------------------

    I have thought about why the idea of more smaller cores has not been
    more successful, at least for the kinds of loads where you have a
    large number of independent and individually not particularly
    demanding threads, as in web shops. My explanation is that you need
    1) memory bandwidth and 2) interconnection with the rest of the
    system.

    Yes, exactly:: if you have a large number of cores doing a performance of
    X, they will need exactly the same memory BW as a smaller number of cores
    also performing at X.

    In addition, the interconnect has to be at least as good as the small core system.

    The interconnection with the rest of the system probably does
    not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
    when they increased the cores in their server chips).

    The bandwidth requirements to main memory for given cache sizes per
    core reduce linearly with the performance of the cores; if the larger
    number of smaller cores really leads to increased aggregate
    performance, additional main memory bandwidth is needed, or you can compensate for that with larger caches.

    Sooner or later, you actually have to read/write main memory.

    But to eliminate some variables, let's just consider the case where we
    want to get the same performance with the same main memory bandwidth
    from using more smaller cores than we use now. Will the resulting CPU require less area? The cache sizes per core are not reduced, and
    their area is not reduced much.

    A core running at ½ the performance can use a cache that is ¼ the size
    and see the same percentage degradation WRT cache misses (as long as
    main memory is equally latent). TLBs too.

    The core itself will get smaller, and

    12× smaller and 12× lower power

    its performance will also get smaller (although by less than the
    core).

    for ½ the performance

    But if you sum up the total per-core area (core, caches, and interconnect), at some point the per-core area reduces by less than
    the per-core performance, so for a given amount of total performance,
    the area goes up.

    GBOoO Cores tend to be about the size of 512KB of L2

    There is one counterargument to these considerations: The largest configuration of Turin dense has less cache for more cores than the
    largest configuration of Turin. I expect that's the reason why they
    offer both; if you have less memory-intensive loads, Turin dense with
    the additional cores will give you more performance, otherwise you
    better buy Turin.

    Also, Intel has added 16 E-Cores to their desktop chips without giving
    them the same amount of caches as the P-Cores; e.g., in Arrow lake we
    have

    P-core 48KB D-L0 64KB I-L1 192KB D-L1 3MB L2 3MB L3/core
    E-Core 32KB D-L1 64KB I-L1 4MB L2/4 cores 3MB L3/4cores

    Here we don't have an alternative with more P-Cores and the same
    bandwidth, so we cannot contrast the approaches. But it's certainly
    the case that if you have a bandwidth-hungry load, you don't need to
    buy the Arrow Lake with the largest number of E-Cores.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 19 11:41:19 2025
    From Newsgroup: comp.arch

    On 9/19/2025 9:33 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

    And most Intel and AMD chips have 150W TDP, either, although the
    shenanigans they play with TDP are not nice. The usual TDP for
    Desktop chips is 65W (with the power limits temporarily or permanently higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
    a configurable TDP of 8-37W.



    Seems so...
    Seems the CPU I am running as a 105W TDP, I had thought I remembered
    150W, oh well...

    Seems 150-200W is more Threadripper territory, and not the generic
    desktop CPUs.


    Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
    slower, it may still win in Perf/W and similar...

    No TDP numbers are given for Oryon. For Apple's M4, the numbers are

    M4 4P 6E 22W
    M4 Pro 8P 4E 38W
    M4 Pro 10P 4E 46W
    M4 Max 10P 4E 62W
    M4 Max 12P 4E 70W

    Not quite 1/30th of the power, although I think that Apple does not
    play the same shenanigans as Intel and AMD.



    A lot of the ARM SoC's I had seen had lower TDPs, though more often with Cortex A53 or A55/A78 cores or similar:

    Say (MediaTek MT6752):
    https://unite4buy.com/cpu/MediaTek-MT6752/
    Has a claimed TDP here of 7W and has 8x A53.

    Or, for a slightly newer chip (2020):
    https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

    TDP 5W, has A55 and A78 cores.


    Some amount of the HiSilicon numbers look similar...


    But, yeah, I guess if using these as data-points:
    A55: ~ 5/8W, or ~ 0.625W (very crude)
    Zen+: ~ 105/16W, ~ 6.56W

    So, more like 10x here, but ...


    Then, I guess it becomes a question of the relative performance
    difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

    Judging based on my cellphone (with A53 cores), and previously running
    my emulator in Termux, there is a performance difference, but nowhere
    near 10x.


    Probably need to set up a RasPi with a 64-bit OS at some point and see
    how this performs... (wouldn't really be as accurate to compare x86-64
    with 32-bit ARM).


    [RISC-V]
    recent proposals for indexed load/store and auto-increment popping up,

    Where can I read about that.


    For now, just on the mailing lists, eg: https://lists.riscv.org/g/tech-arch-review/message/368


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 19 12:00:07 2025
    From Newsgroup: comp.arch

    On 9/19/2025 4:50 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB <cr88192@gmail.com>:
    Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.

    That sounds a whole lot like what Transmeta did 25 years ago:

    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.


    Might be different now:
    25 years ago, Moore's law was still going strong, and the general
    concern was more about maximizing scalar performance rather than energy
    efficiency or core count (and, in those days, processors were generally
    single-core).

    IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
    2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
    would be a good idea nowadays, it would not have been canceled.

    How should the number of cores change anything? If you cannot make single-threaded IA-32 or AMD64 programs run at competetive speeds on
    IA-64 hardware, how would that inefficiency be eliminated in
    multi-threaded programs?

    Now we have a different situation:
    Moore's law is dying off;

    Even if that is the case, how should that change anything about the
    relative merits of the two approaches?

    Scalar CPU performance has hit a plateau;

    True, but again, what's the relevance for the discussion at hand?

    And, for many uses, performance is "good enough";

    In that case, better buy a cheaper AMD64 CPU rather than a
    particularly fast CPU with a different architecture X and then run a
    dynamic AMD64->X translator on it.


    Possibly, it depends.

    The question is what could Intel or AMD do if the wind blew in that
    direction.

    For the end-user, the experience is likely to look similar, so they
    might not need to know/care if they are using some lower-power native
    chip, or something that is internally running on a dynamic translator to
    some likely highly specialized ISA.



    A lot more software can make use of multi-threading;

    Possible, but how would it change things?


    Multi-threaded software does not tend to depend as much on single-thread performance as single threaded software...


    Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
    well, whereas something like a RISC style ISA can get better performance
    on a comparably smaller and cheaper core, and with a somewhat better
    "performance per watt" metric.

    Evidence?


    No hard numbers, but experience here:
    ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
    ARM11 cores).

    The RasPi basically runs circles around the Eee...


    Though, no good datapoints for fast x86 emulators here.
    At least DOSBox and QEMU running x86 on RasPi tend to be dead slow.



    ( no time right now, so skipping rest )

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 19 12:38:51 2025
    From Newsgroup: comp.arch

    On 9/19/2025 12:00 PM, BGB wrote:
    On 9/19/2025 4:50 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB  <cr88192@gmail.com>:
    Still sometimes it seems like it is only a matter of time until
    Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>>> hardware level, but then pretends to still be an x86 chip by running >>>>> *everything* in a firmware level emulator via dynamic translation.

    That sounds a whole lot like what Transmeta did 25 years ago:

    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now.  Their
    native architecture was VLIW which might have been part
    of the problem.


    Might be different now:
    25 years ago, Moore's law was still going strong, and the general
    concern was more about maximizing scalar performance rather than energy
    efficiency or core count (and, in those days, processors were generally
    single-core).

    IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
    2012) has 8 cores.  If IA-64 (and dynamically translating AMD64 to it)
    would be a good idea nowadays, it would not have been canceled.

    How should the number of cores change anything?  If you cannot make
    single-threaded IA-32 or AMD64 programs run at competetive speeds on
    IA-64 hardware, how would that inefficiency be eliminated in
    multi-threaded programs?

    Now we have a different situation:
       Moore's law is dying off;

    Even if that is the case, how should that change anything about the
    relative merits of the two approaches?

       Scalar CPU performance has hit a plateau;

    True, but again, what's the relevance for the discussion at hand?

       And, for many uses, performance is "good enough";

    In that case, better buy a cheaper AMD64 CPU rather than a
    particularly fast CPU with a different architecture X and then run a
    dynamic AMD64->X translator on it.


    Possibly, it depends.

    The question is what could Intel or AMD do if the wind blew in that direction.

    For the end-user, the experience is likely to look similar, so they
    might not need to know/care if they are using some lower-power native
    chip, or something that is internally running on a dynamic translator to some likely highly specialized ISA.



       A lot more software can make use of multi-threading;

    Possible, but how would it change things?


    Multi-threaded software does not tend to depend as much on single-thread performance as single threaded software...


    Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
    well, whereas something like a RISC style ISA can get better performance >>> on a comparably smaller and cheaper core, and with a somewhat better
    "performance per watt" metric.

    Evidence?


    No hard numbers, but experience here:
    ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
    ARM11 cores).

    The RasPi basically runs circles around the Eee...


    Though, no good datapoints for fast x86 emulators here.
      At least DOSBox and QEMU running x86 on RasPi tend to be dead slow.



    ( no time right now, so skipping rest )


    Seems I have a little time still...

    Did find this: https://browser.geekbench.com/v4/cpu/compare/2498562?baseline=2792960

    Not an exact match, I think the Eee was running the Atom at a somewhat
    lower clock speed; and this is vs a Pi3 vs original Pi.
    The Pi3 having 4x A53 cores.


    But, yeah, they are roughly matched on single thread performance when
    the Atom has a clock-speed advantage.

    Though, this seems to imply that they are more just "comparable" on the performance front, rather than Atom being significantly slower...


    Would need to try to dig-out the Eee and re-test, assuming it still
    works/etc.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Sep 19 17:48:52 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or >>>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>>> hardware level, but then pretends to still be an x86 chip by running >>>>> *everything* in a firmware level emulator via dynamic translation.
    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.
    That's nonsense; regulars of this groups should know better, at least
    this nonsense has been corrected often enough. E.g., I wrote in
    <2015Dec6.152525@mips.complang.tuwien.ac.at>:

    |Not even if the microcode the Intel and AMD chips used was really
    |RISC-like, which it was not (IIRC the P6 uses micro-instructions with
    |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop"
    |standing for "RISC").
    I don't know what you are objecting to

    I am objecting to the claim that uops are RISC-like, and that there is
    a translation to RISC occuring inside the CPU, and (not present here,
    but often also claimed) that therefore there is no longer a difference between RISC and non-RISC.

    Ok. I disagree with this because I have a different view of the
    changes in moving from CISC to RISC (which I'll describe below).

    One can discuss the details, but at the end of the day, uops are some implementation-specific internals of the microarchitecture, whereas a
    RISC architecture is an architecture.

    The number of bits has nothing to do with what it is called.
    If this uOp was for a ROB style design where all the knowledge about
    each instruction including register ids, immediate data,
    scheduling info, result data, status, is stored in a single ROB entry,
    then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.

    Yes, P6 is the code name for the Pentium Pro, which has a ROB, and,
    more importantly valued reservation stations, and yes, the 118 or
    whatever bits include the operands. I have no idea how the P6 handles
    its 80-bit FP with valued RSs; maybe it has bigger uops in its FP part
    (but I think it has a unified scheduler, so that would not work out,
    or maybe I miss something).

    But concerning the discussion at hand: Containing the data is a
    significant deviation from RISC instruction sets, and RISC
    instructions are typically only 32 bits or 16 bits wide.

    Yes, and those 32-bit external ISA instructions are mapped into uOps internally. All that is different here is the difficulty for decode.

    I see the difference between CISC and RISC as in the micro-architecture, changing from a single sequential state machine view to multiple concurrent machines view, and from Clocks Per Instruction to Instructions Per Clock.

    The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
    386, 486 and Pentium, is like a single threaded program which
    operates sequentially on a single global set of state variables.
    While there is some variation and fuzziness around the edges,
    the heart of each of these are single sequential execution engines.

    An important consequence of the sequential design is that
    most of this machine is sitting idle most of the time.

    One can take an Alpha ISA and implement it with a microcoded sequencer
    but that should not be called RISC so the distinction must lie elsewhere.

    RISC changes that design to one like a multi-threaded program with
    messages passing between them called uOps, where the dynamic state
    of each instruction is mostly carried with the uOp message,
    and each thread does something very simple and passes the uOp on.
    Where global resources are required, they are temporarily dynamically
    allocated to the uOp by the various threads, carried with the uOp,
    and returned later when the uOp message is passed to the Retire thread.
    The Retire thread is the only one which updates the visible global state.

    As I see it, this Multiple Simple Thread Message Passing Architecture
    (MST-MPA) is the essence of the change RISC invoked, and any
    micro-architecture that follows it is in the risc design style.

    The RISC design guidelines described by various papers, rather than
    go/no-go decisions, are mostly engineering compromises for consideration
    of things which would make an MST-MPA more expensive to implement or
    otherwise interfere with maximizing the active concurrency of all threads. Whether the register file has 8, 16, or 32 entries affects the frequency
    of stalls but doesn't change whether it is implemented as MST-MPA and
    therefore entitled to be called "RISC".

    This is why I think it would have been possible to build a risc-style
    PDP-11 in 1975 TTL, or a VAX if they had just left the instructions of
    the same complexity as PDP-11 ISA (53 opcodes, max one immediate,
    max one mem op per instruction).

    Another difference is that the OoO engine that sees the uOps performs
    only a very small part of the functionality of branches, with the
    majority performed by the front end. I.e., there is no branching in
    the OoO engine that sees the uOps, at the most it confirms the branch
    prediction, or diagnoses a misprediction, at which point the OoO
    engine is out of a job and has to wait for the front end; possibly
    only the ROB (which deals with instructions again) resolves the
    misprediction and kicks the front end into action, however.
    And a uOp triggers that action sequence.
    I don't see the distinction you are trying to make.

    The major point is that the OoO engine (the part that deals with uops)
    sees a linear sequence of uops it has to process, with nearly all
    actual branch processing (which an architecture has to do) done in a
    part that does not deal with uops. With the advent of uop caches that
    has changed a bit, but many of the CPUs for which the uop=RISC claim
    has been made do not have an uop cache.

    There are multiple places that can generate next RIP addresses:
    - The incremented RIP for the current instruction
    - Branch Prediction can redirect Fetch
    - Decode can pick off unconditional branches and immediately redirect Fetch.
    - Decode also could notice if a the branch predictor made an erroneous
    decision and redirect Fetch.
    - Register Read might forward a "JMP reg" address directly to Fetch.
    - The Branch Unit BRU has a uOp scheduler to wait for in-flight registers
    or condition codes and then processes all branch & jump uOps and
    possibly redirects Fetch, and update Branch Prediction.
    - uOp Retire detects exceptions and can force a Fetch redirect.
    - Interrupts can redirect Fetch.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 07:56:49 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 19 Sep 2025 09:50:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    The interconnection with the rest of the system probably does
    not get much cheaper for the smaller cores, and probably becomes more
    expensive with more cores (e.g., Intel switched from a ring to a grid
    when they increased the cores in their server chips).


    That particualr problem is addressed by grouping smaller cores into
    clusters with shared L2 cache. It's especially effective for scaling
    when L2 cache is true inclusive relatively to underlying L1 caches.
    The price is limited L2 bandwidth as seen by the cores.

    The other price is longer L2 latency; on a Core Ultra 9 285K:

    L2 L3 DRAM
    Skymont 4.24ns 14.92ns ~180ns
    Lion Cove 2.98ns 14.75ns 99.52ns

    Numbers from <https://chipsandcheese.com/p/analyzing-lion-coves-memory-subsystem> <https://chipsandcheese.com/p/skymont-in-desktop-form-atom-unleashed>

    Estimated from the graph where I could not find numbers.

    I wonder what slows down the DRAM access of Skymont on the same chip
    so much when the L3 latency is so close.

    Yes, organizing the interconnect in a hierarchical way can help reduce
    the increase in interconnect cost, but I expect that there is a reason
    why Intel did not do that for its server CPUs with P-Cores, by e.g.,
    forming clusters of 4, and then continuing with the ring; instead,
    they opted for a grid interconnect.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 08:33:37 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:
    -------------------------------

    Yes, so much is clear. It's not clear where Macro-Ops are in play and
    where Micro-Ops are in play. Over time I get the impression that the
    macro-ops are the main thing running through the OoO engine, and
    Micro-Ops are only used in specific places, but it's completely
    unclear to me where. E.g., if they let an RMW Macro-Op run through
    the OoO engine, it would first go to the LSU for the address
    generation, translation and load, then to the ALU for the
    modification, then to the LSU for the store, and then to the ROB.
    Where in this whole process is a Micro-Op actually stored?

    In the reservation station.

    Ok, so what I currently imagine is this: The macro-op contains tags or
    (for non-valued reservation stations) register numbers for the
    intermediate results. It is sent to the affected reservation
    stations, which picks the parts relevant for it out of the macro-op,
    thus forming a micro-op. If one of the reservation stations is full,
    I expect that the macro-op is kept back in the front end. The ROB
    does not need to wait for each micro-op, but only for the last one in
    the macro-op (if the micro-ops have one last one, which they have in
    case of load-op instructions (the op is last) and RMW instructions
    (the W is last).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 08:47:10 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB <cr88192@gmail.com>: >--------------------------------------

    I have thought about why the idea of more smaller cores has not been
    more successful, at least for the kinds of loads where you have a
    large number of independent and individually not particularly
    demanding threads, as in web shops. My explanation is that you need
    1) memory bandwidth and 2) interconnection with the rest of the
    system.

    Yes, exactly:: if you have a large number of cores doing a performance of
    X, they will need exactly the same memory BW as a smaller number of cores >also performing at X.

    The memory subsystem plays a big role, however. The caches filter
    away many of the main-memory accesses.

    Sooner or later, you actually have to read/write main memory.

    In general, no. If the caches are large enough, the code and data can
    be loaded from the disk or the network into the cache, processed
    there, and the sent out to the disk or network without ever accessing
    DRAM.

    And that's not just a theoretical thing: There are network packet
    routers with Xeon-D CPUs where the network interfaces deliver the
    packets into L3 cache, the program looks at each packet, decides where
    it is sent, and performs the appropriate action, all within the
    caches; the end result is then consumed by the network interfaces
    again. There are 70ns time per packet, so there is no time for a DRAM
    access and its latency (I expect that there will be some loading from
    DRAM when an unusual route is needed that is not cached, but for the
    majority of packets, there is no time for that).

    On the more theoretical side, that's the main fallacy in "Hitting the
    Memory Wall" (1995). I have written a critique of that paper in 2001
    and announced it here <9fst8d$60u$1@news.tuwien.ac.at>; you can find
    it on <http://www.complang.tuwien.ac.at/anton/memory-wall.html>.
    Interestingly, I have now found a retrospective paper about this from
    2004 by McKee: <http://svmoore.pbworks.com/w/file/fetch/59055930/p162-mckee.pdf>; she
    mentions comp.arch several times, but apparently missed my posting or
    did not find it relevant enough to address it in her retrospective
    (given the large number of other reactions that the paper received,
    the latter would not be surprising).

    But to eliminate some variables, let's just consider the case where we
    want to get the same performance with the same main memory bandwidth
    from using more smaller cores than we use now. Will the resulting CPU
    require less area? The cache sizes per core are not reduced, and
    their area is not reduced much.

    A core running at ½ the performance can use a cache that is ¼ the size
    and see the same percentage degradation WRT cache misses (as long as
    main memory is equally latent). TLBs too.

    Yes (as long as the latency does not rise), but if they do, the number
    of memory accesses filtered out by the caches descreases, and the DRAM bandwidth required by the core increases beyond the 1/2 value. So now
    you have twice the number of cores, each with more than 1/2 memory
    bandwidth requirement. So you need to increase the memory bandwidth,
    or you will lose performance; the mechanism for the lower performance
    is that the latency of the memory accesses increases from having to
    wait for other memory accesses to be served.

    The alternative I outlined is to use same-sized caches (per core), so
    the caches filter just as well as with the big cores, and the 2n
    smaller cores need the same memory bandwidth as the n larger cores.

    12× smaller and 12× lower power
    for 1/2 the performance

    In a Samsung Exynos9820, the Cortex-A75 has 3-4 times the size of a
    Cortex-A55; power and performance depend on where on the
    voltage-frequency curve we use these cores; they have similar
    performance/watt ranges, and at the same performance/watt, the
    performance of the A75 is 3-4 times higher than that of the
    A55. <2024Jan24.084731@mips.complang.tuwien.ac.at> <2024Jan24.225412@mips.complang.tuwien.ac.at>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 10:25:40 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 9/19/2025 9:33 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

    And most Intel and AMD chips have 150W TDP, either, although the
    shenanigans they play with TDP are not nice. The usual TDP for
    Desktop chips is 65W (with the power limits temporarily or permanently
    higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
    configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
    a configurable TDP of 8-37W.



    Seems so...
    Seems the CPU I am running as a 105W TDP, I had thought I remembered
    150W, oh well...

    Seems 150-200W is more Threadripper territory, and not the generic
    desktop CPUs.


    Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
    slower, it may still win in Perf/W and similar...

    No TDP numbers are given for Oryon. For Apple's M4, the numbers are

    M4 4P 6E 22W
    M4 Pro 8P 4E 38W
    M4 Pro 10P 4E 46W
    M4 Max 10P 4E 62W
    M4 Max 12P 4E 70W

    Not quite 1/30th of the power, although I think that Apple does not
    play the same shenanigans as Intel and AMD.



    A lot of the ARM SoC's I had seen had lower TDPs, though more often with >Cortex A53 or A55/A78 cores or similar:

    Say (MediaTek MT6752):
    https://unite4buy.com/cpu/MediaTek-MT6752/
    Has a claimed TDP here of 7W and has 8x A53.

    With 1700MHz.

    Data from <https://www.complang.tuwien.ac.at/franz/latex-bench>;
    numbers are times in seconds:

    Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
    Core i5-1135G7, 4134MHz, 8MB L3, Ubuntu 21.04 (64-bit) 0.279

    The Core i5-1135G7 is limited to 12W on the machine where I measured
    this; the Core i5-1135G7 is 8.9 times faster than the 1896MHz Cortex
    A53, so the 1700MHz A53 is probably about 9.9 times slower than the
    Core i5-1135G7. That's for a single core. I cannot measure how far
    the MT6752 clocks down under multi-core/multi-thread load. I did it
    for the Core i5-1135G7:

    wget http://www.complang.tuwien.ac.at/anton/latex-bench/bench.tex
    wget http://www.complang.tuwien.ac.at/anton/latex-bench/types.bib
    for i in 0 1 2 3 4 5 6 7; do mkdir $i; cp bench.tex types.bib $i; done
    for i in 0 1 2 3 4 5 6 7; do (cd $i; taskset -c $i sh -c "latex bench >/dev/null; bibtex bench >/dev/null; while true; do /bin/time -f\"%U\" latex bench >/dev/null; done" &); done

    When using all 8 threads, the CPU clocked itself down to 2100MHz (the
    base frequency for TDP=12W is 900MHz), and each LaTeX benchmark ran in 0.93s-0.99s user time (8 in parallel). I.e., 8*0.12s per invocation.

    I also measured it with only 4 processes, one for each core. The
    clock was 2400-2500MHz, the times 0.51s-0.53s, i.e., 4*0.13s per
    invocation. The throughput advantage of SMT is very small here.

    Anyway, the Core i5-1135G7 gets one run every 0.12s from 12W, while
    the MT6752 gets one run every 0.35s (2.488*1896/1700/8, almost three
    times slower) from 7W, even if we can assume it can do 1700MHz on all
    cores while staying in the 7W. In any case, the bottom line is that
    the Core i5-1135G7 at 12W is more power-efficient than the MT6752, and
    that's with the A53 running the benchmark native, not in emulation.

    [RISC-V]
    recent proposals for indexed load/store and auto-increment popping up,

    Where can I read about that.


    For now, just on the mailing lists, eg: >https://lists.riscv.org/g/tech-arch-review/message/368

    Interesting, thanks.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 11:48:00 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 9/19/2025 4:50 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    And, for many uses, performance is "good enough";

    In that case, better buy a cheaper AMD64 CPU rather than a
    particularly fast CPU with a different architecture X and then run a
    dynamic AMD64->X translator on it.


    Possibly, it depends.

    The question is what could Intel or AMD do if the wind blew in that >direction.

    What direction?

    Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
    well, whereas something like a RISC style ISA can get better performance >>> on a comparably smaller and cheaper core, and with a somewhat better
    "performance per watt" metric.

    Evidence?


    No hard numbers, but experience here:
    ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
    ARM11 cores).

    The RasPi basically runs circles around the Eee...

    That's probably a software problem. Different Eee PC models have
    different CPUs, Celeron M @571Mhz, 900MHz, or 630MHz, Atoms with
    1330-1860Mhz, or AMD C-50 or E350. All of them are quite a bit faster
    than the 700Mhz ARM11. While I don't have a Raspi1 result on https://www.complang.tuwien.ac.at/franz/latex-bench, I have a Raspi 3
    result (and the Raspi 3 with its 1200MHz 2-wide core is quite a bit
    faster than the 700Mhz ARM11), and also some CPUs similar to those
    used in the Eee PC; numbers are times in seconds:

    - Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
    - Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89
    - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
    - AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216

    So all of these CPUs clearly beat the one in the Raspi3, which I
    expect to be clearly faster than the ARM11.

    Now imagine running the software that made the Eee PC so slow with
    dynamic translation on a Raspi1. How slow would that be?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 12:01:39 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I see the difference between CISC and RISC as in the micro-architecture,

    But the microarchitecture is not an architectural criterion.

    changing from a single sequential state machine view to multiple concurrent >machines view, and from Clocks Per Instruction to Instructions Per Clock.

    People changed from talking CPI to IPC when CPI started to go below 1.
    That's mainly a distinction between single-issue and superscalar CPUs.

    The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
    386, 486 and Pentium, is like a single threaded program which
    operates sequentially on a single global set of state variables.
    While there is some variation and fuzziness around the edges,
    the heart of each of these are single sequential execution engines.

    The same holds true for the MIPS R2000, the ARM1/2 (and probably many successors), probably early SPARCs and early HPPA CPUs, all of which
    are considered as RISCs. Documents about them also talk about CPI.

    And the 486 is already pipelined and can perform straight-line code at
    1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
    straight-line code).

    One can take an Alpha ISA and implement it with a microcoded sequencer
    but that should not be called RISC

    Alpha is a RISC architecture. So this hypothetical implementation
    would certainly be an implementation of a RISC architecture.

    RISC changes that design to one like a multi-threaded program with
    messages passing between them called uOps, where the dynamic state
    of each instruction is mostly carried with the uOp message,
    and each thread does something very simple and passes the uOp on.
    Where global resources are required, they are temporarily dynamically >allocated to the uOp by the various threads, carried with the uOp,
    and returned later when the uOp message is passed to the Retire thread.
    The Retire thread is the only one which updates the visible global state.

    This does not sound like RISC vs. non-RISC at all, but like OoO microarchitecture, and the contrast would be an in-order execution microarchitecture. Both RISCs and non-RISCs can make use of OoO microarchitectures, and have done so.

    The RISC design guidelines described by various papers, rather than
    go/no-go decisions, are mostly engineering compromises for consideration
    of things which would make an MST-MPA more expensive to implement or >otherwise interfere with maximizing the active concurrency of all threads.

    The interesting aspect is that RISCs are easier to implement in simple pipelines like the ones of early ARM, HPPA, MIPS and SPARC
    implementations, but can also be implemented as in-order superscalar
    or OoO superscalar microarchitectures; you can also implement it as sequentially-executed microcode engine. Wolfgang Kleinert implemented
    a microcoded RISC in the 1980s, but I think that it was pipelined.

    The advantages from the instruction set diminish with the more complex implementation techniques, and there are a number of instruction set
    design decisions in early RISCs that turned out to be not so great and
    that were eliminated in later RISCs (if not from the start), most
    notably delayed branches, but many of the recent instruction sets (ARM
    A64, RISC-V) take many of the same design decisions as the RISC
    architectures of the 1980s (load/store, register architecture, etc.,
    see John Mashey's criteria and recent discussions about this topic),
    whereas many non-RISCs deviate from this design style.

    This is why I think it would have been possible to build a risc-style
    PDP-11 in 1975 TTL, or a VAX if they had just left the instructions of
    the same complexity as PDP-11 ISA (53 opcodes, max one immediate,
    max one mem op per instruction).

    The PDP-11 instruction set is not RISC, and you paint a picture that
    is too rosy: It has up to two mem ops per instruction, and IIRC even memory-indirect addressing modes. Not a problem for the
    physically-addressed first implementations, nasty as soon as you add
    virtual memory.

    Implementing a pipelined implementation of PDP-11 (like the 486 was
    for IA-32) for PDP-11 would have been quite a bit harder than for the
    486 (admittedly the 486 has to deal with 16-bit modes and other legacy features, so it's not the easiest target, either).

    For the VAX I would go for a RISC instead of a cleaned-up IA-32-like instruction set, and then implement pipelining. I would rather put
    the effort in implementing compressed instructions rather than
    load-and-op or RMW instructions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Sep 20 13:10:49 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> wrote:
    On 9/19/2025 9:33 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

    And most Intel and AMD chips have 150W TDP, either, although the
    shenanigans they play with TDP are not nice. The usual TDP for
    Desktop chips is 65W (with the power limits temporarily or permanently
    higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
    configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
    a configurable TDP of 8-37W.



    Seems so...
    Seems the CPU I am running as a 105W TDP, I had thought I remembered
    150W, oh well...

    Seems 150-200W is more Threadripper territory, and not the generic
    desktop CPUs.


    Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
    slower, it may still win in Perf/W and similar...

    No TDP numbers are given for Oryon. For Apple's M4, the numbers are

    M4 4P 6E 22W
    M4 Pro 8P 4E 38W
    M4 Pro 10P 4E 46W
    M4 Max 10P 4E 62W
    M4 Max 12P 4E 70W

    Not quite 1/30th of the power, although I think that Apple does not
    play the same shenanigans as Intel and AMD.



    A lot of the ARM SoC's I had seen had lower TDPs, though more often with Cortex A53 or A55/A78 cores or similar:

    Say (MediaTek MT6752):
    https://unite4buy.com/cpu/MediaTek-MT6752/
    Has a claimed TDP here of 7W and has 8x A53.

    Or, for a slightly newer chip (2020):
    https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

    TDP 5W, has A55 and A78 cores.


    Some amount of the HiSilicon numbers look similar...


    But, yeah, I guess if using these as data-points:
    A55: ~ 5/8W, or ~ 0.625W (very crude)
    Zen+: ~ 105/16W, ~ 6.56W

    So, more like 10x here, but ...


    Then, I guess it becomes a question of the relative performance
    difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

    Judging based on my cellphone (with A53 cores), and previously running
    my emulator in Termux, there is a performance difference, but nowhere
    near 10x.

    Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks
    to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to about 50000 DMIPS. Dhrystone contain string operations which benefit
    from SSE/AVX, but I would expect that on media load speed ratio would
    be even more favourable to desktop core. On jumpy code ratio is probably lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.

    It is hard to compare performance per watt: Orange Pi Zero 3 has low
    power draw (of order 100 mA from 5V USB charger with one core active) and
    it is not clear how it is distributed between CPU-s and Etherent interface. RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
    normally seem to run at at fraction of rated power too (but I have
    no way to directly measure CPU power draw).

    Of course, there is a catch: desktop CPU is made on more advanced
    process than small processors. So it is hard to separate effects
    from architecture and from the process.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Sep 20 19:32:17 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I see the difference between CISC and RISC as in the micro-architecture,

    But the microarchitecture is not an architectural criterion.

    changing from a single sequential state machine view to multiple concurrent >> machines view, and from Clocks Per Instruction to Instructions Per Clock.

    People changed from talking CPI to IPC when CPI started to go below 1.
    That's mainly a distinction between single-issue and superscalar CPUs.

    The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
    386, 486 and Pentium, is like a single threaded program which
    operates sequentially on a single global set of state variables.
    While there is some variation and fuzziness around the edges,
    the heart of each of these are single sequential execution engines.

    The same holds true for the MIPS R2000, the ARM1/2 (and probably many successors), probably early SPARCs and early HPPA CPUs, all of which
    are considered as RISCs. Documents about them also talk about CPI.

    And the 486 is already pipelined and can perform straight-line code at
    1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
    straight-line code).

    Maybe relevant:

    Performance optimizers writing asm regularly hit that 1 IPC on the 486
    and (with more difficulty) 2 IPC on the Pentium.

    When we did get there, the final performance was typically 3X compiled C
    code.

    That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
    the PPro and later OoO CPUs.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 20 17:38:19 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I see the difference between CISC and RISC as in the micro-architecture,

    But the microarchitecture is not an architectural criterion.

    changing from a single sequential state machine view to multiple concurrent >>> machines view, and from Clocks Per Instruction to Instructions Per Clock. >>
    People changed from talking CPI to IPC when CPI started to go below 1.
    That's mainly a distinction between single-issue and superscalar CPUs.

    The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
    386, 486 and Pentium, is like a single threaded program which
    operates sequentially on a single global set of state variables.
    While there is some variation and fuzziness around the edges,
    the heart of each of these are single sequential execution engines.

    The same holds true for the MIPS R2000, the ARM1/2 (and probably many
    successors), probably early SPARCs and early HPPA CPUs, all of which
    are considered as RISCs. Documents about them also talk about CPI.

    And the 486 is already pipelined and can perform straight-line code at
    1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
    straight-line code).

    Maybe relevant:

    Performance optimizers writing asm regularly hit that 1 IPC on the 486
    and (with more difficulty) 2 IPC on the Pentium.

    When we did get there, the final performance was typically 3X compiled C code.

    That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
    the PPro and later OoO CPUs.

    And then came back with SIMD, I presume? :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Sep 20 22:01:27 2025
    From Newsgroup: comp.arch

    On Sat, 20 Sep 2025 07:56:49 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    Yes, organizing the interconnect in a hierarchical way can help reduce
    the increase in interconnect cost, but I expect that there is a reason
    why Intel did not do that for its server CPUs with P-Cores, by e.g.,
    forming clusters of 4, and then continuing with the ring; instead,
    they opted for a grid interconnect.

    - anton


    I don't know for sure, but would imagine that the reason is that their
    server CPUs with P-core have the same design for low-to-mid end "cloud"
    models and for high-end "enterpise" models. High-end models have OLTP
    and similar enterprise workloads as rather important market. Flatter
    LLC is better for OLTP/enterprise than dozen or two of separate L3
    caches. Besides, their current L2 caches are rather big, so if they
    make those separate L3s true exclusive, which is optimal for reduction
    of cc traffic, then there would be rather big waste of total cache
    capacity.

    An alternative is to left LLC intact and instead make L2s shared by
    pairs of cores. That is unacceptable because of yet another market
    addressed by the same Xeons line - computations/HPC, where being
    limited by L2 bandwidth is not rare even now. With shared L2 it will
    become very common.

    3 different uncore designs for 3 different markets can solve that
    nicely, but of course in the Intel's current financial situation that
    is unthinkable. Probably even current arrangement with 3 Xeon lines
    (Xeon-E = desktop chips with E-cores fused off, Seirra Forrest = plenty
    of Crestmont cores and "normal" Xeons currently represented by Granite
    Rapids) could be unsustainable.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Sep 20 21:14:23 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I see the difference between CISC and RISC as in the micro-architecture, >>>
    But the microarchitecture is not an architectural criterion.

    changing from a single sequential state machine view to multiple concurrent
    machines view, and from Clocks Per Instruction to Instructions Per Clock. >>>
    People changed from talking CPI to IPC when CPI started to go below 1.
    That's mainly a distinction between single-issue and superscalar CPUs.

    The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX, >>>> 386, 486 and Pentium, is like a single threaded program which
    operates sequentially on a single global set of state variables.
    While there is some variation and fuzziness around the edges,
    the heart of each of these are single sequential execution engines.

    The same holds true for the MIPS R2000, the ARM1/2 (and probably many
    successors), probably early SPARCs and early HPPA CPUs, all of which
    are considered as RISCs. Documents about them also talk about CPI.

    And the 486 is already pipelined and can perform straight-line code at
    1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
    straight-line code).

    Maybe relevant:

    Performance optimizers writing asm regularly hit that 1 IPC on the 486
    and (with more difficulty) 2 IPC on the Pentium.

    When we did get there, the final performance was typically 3X compiled C
    code.

    That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
    the PPro and later OoO CPUs.

    And then came back with SIMD, I presume? :-)

    Sure!

    I typically got 3X SIMD speedup from 4-way processing, years before any compilers were able to autovectorize to again partly close the gap.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Sep 20 16:10:40 2025
    From Newsgroup: comp.arch

    On 9/20/2025 8:10 AM, Waldek Hebisch wrote:
    BGB <cr88192@gmail.com> wrote:
    On 9/19/2025 9:33 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

    And most Intel and AMD chips have 150W TDP, either, although the
    shenanigans they play with TDP are not nice. The usual TDP for
    Desktop chips is 65W (with the power limits temporarily or permanently
    higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
    configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
    a configurable TDP of 8-37W.



    Seems so...
    Seems the CPU I am running as a 105W TDP, I had thought I remembered
    150W, oh well...

    Seems 150-200W is more Threadripper territory, and not the generic
    desktop CPUs.


    Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >>>> slower, it may still win in Perf/W and similar...

    No TDP numbers are given for Oryon. For Apple's M4, the numbers are

    M4 4P 6E 22W
    M4 Pro 8P 4E 38W
    M4 Pro 10P 4E 46W
    M4 Max 10P 4E 62W
    M4 Max 12P 4E 70W

    Not quite 1/30th of the power, although I think that Apple does not
    play the same shenanigans as Intel and AMD.



    A lot of the ARM SoC's I had seen had lower TDPs, though more often with
    Cortex A53 or A55/A78 cores or similar:

    Say (MediaTek MT6752):
    https://unite4buy.com/cpu/MediaTek-MT6752/
    Has a claimed TDP here of 7W and has 8x A53.

    Or, for a slightly newer chip (2020):
    https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

    TDP 5W, has A55 and A78 cores.


    Some amount of the HiSilicon numbers look similar...


    But, yeah, I guess if using these as data-points:
    A55: ~ 5/8W, or ~ 0.625W (very crude)
    Zen+: ~ 105/16W, ~ 6.56W

    So, more like 10x here, but ...


    Then, I guess it becomes a question of the relative performance
    difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

    Judging based on my cellphone (with A53 cores), and previously running
    my emulator in Termux, there is a performance difference, but nowhere
    near 10x.

    Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to about 50000 DMIPS. Dhrystone contain string operations which benefit
    from SSE/AVX, but I would expect that on media load speed ratio would
    be even more favourable to desktop core. On jumpy code ratio is probably lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.

    It is hard to compare performance per watt: Orange Pi Zero 3 has low
    power draw (of order 100 mA from 5V USB charger with one core active) and
    it is not clear how it is distributed between CPU-s and Etherent interface. RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
    normally seem to run at at fraction of rated power too (but I have
    no way to directly measure CPU power draw).

    Of course, there is a catch: desktop CPU is made on more advanced
    process than small processors. So it is hard to separate effects
    from architecture and from the process.


    I had noted before that when I compiled Dhrystone on my Ryzen using
    MSVC, it is around 10M, or 5691 DMIPs, or around 1.53 DMIPs/MHz.

    Curiously, the score is around 4x higher (around 40M) if Dhrystone is
    compiled with GCC (and around 2.5x with Clang).

    For most other things, the performance scores seem closer.

    I don't really trust GCC's and Clang's Dhrystone scores as they seem
    basically out-of-line with most other things I can measure.



    Noting my BJX2 core seems to perform at 90K at 50MHz, or 1.02 DMIPS/MHz.
    If assuming MSVC as the reference, this would imply (after normalizing
    for clock-speeds) that the Ryzen only gets around 50% more IPC.



    I noted when compiling my BJX2 emulator:
    My Ryzen can emulate it at roughly 70MHz;
    My cell-phone can manage it at roughly 30MHz.

    This isn't *that* much larger than the difference in CPU clock speeds.


    It is like, I seemingly live in a world where a lot of my own benchmark attempts tend to be largely correlated with the relative different in
    clock speeds and similar.


    Well, except for my old laptop (from 2003), and an ASUS Eee, which seem
    to perform somewhat below that curve.


    Though, in the case of the laptop, it may be a case of not getting all
    that much memory bandwidth from a 100MHz DDR1 SO-DIMM (a lot of the performance on some tests seems highly correlated with "memcpy()"
    speeds, and on that laptop, its memcpy speeds are kinda crap if compared
    with CPU clock-speed).

    Well, and the Eee has, IIRC, an Intel Atom N270 down-clocked to 630 MHz.
    Thing ran Quake and Quake 2 pretty OK, but not much else.


    Though, if running the my emulator on the laptop, it is more back on the
    curve of relative clock-speed, rather than on the
    relative-memory-bandwidth curve.

    It seems both my neural-net stuff and most of my data compression stuff,
    more follow the memory bandwidth curve (though, for the laptop, it seems
    NN stuff can get a big boost here by using BFloat16 and getting a little clever with the repacking).


    Well, and then my BJX2 core seems to punch slightly outside its weight
    class (MHz wise) by having disproportionately high memory bandwidth.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Sep 20 22:01:48 2025
    From Newsgroup: comp.arch

    On 9/20/2025 6:48 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 9/19/2025 4:50 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    And, for many uses, performance is "good enough";

    In that case, better buy a cheaper AMD64 CPU rather than a
    particularly fast CPU with a different architecture X and then run a
    dynamic AMD64->X translator on it.


    Possibly, it depends.

    The question is what could Intel or AMD do if the wind blew in that
    direction.

    What direction?


    In some directly where emulating x86 on in-order cores was preferable to having x86 in hardware...

    May or may not be "extreme budget".



    Though, I am writing this after having to battle for a while to get
    "boot magic" out of a Dell OptiPlex that I got on Amazon for $80.

    Turned out the UEFI BIOS was not installed correctly on the PC, which
    was effectively "utterly helpless" without it.

    Had to use a Dell tool to make an installer image on a USB thumb-drive,
    to get a bootable BIOS to configure the thing into a form where it could actually boot (where, it could then apparently install the BIOS config
    UI from the USB drive). Apparently no support for Legacy Boot, option
    was listed by just sort of grayed out and could not be selected
    (apparently no TPM either, so can't run Win 11).

    But, for $80, could get something with a Core i3 and a 500GB HDD.
    Case was only really designed to handle a 2.5" drive, no space to fit a
    3.5" HDD.


    Going much cheaper, it apparently crosses from HDD into "eMMC Flash" territory, but "64GB eMMC Flash" was maybe a little too budget.

    There were also some options with M.2, but I wanted SATA. At least, in
    theory, with SATA once can swap HDDs if needed, but this is seemingly
    hindered if the firmware is so limited as to be rendered helpless if if
    can't load it from the HDD.


    Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
    well, whereas something like a RISC style ISA can get better performance >>>> on a comparably smaller and cheaper core, and with a somewhat better
    "performance per watt" metric.

    Evidence?


    No hard numbers, but experience here:
    ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
    ARM11 cores).

    The RasPi basically runs circles around the Eee...

    That's probably a software problem. Different Eee PC models have
    different CPUs, Celeron M @571Mhz, 900MHz, or 630MHz, Atoms with 1330-1860Mhz, or AMD C-50 or E350. All of them are quite a bit faster
    than the 700Mhz ARM11. While I don't have a Raspi1 result on https://www.complang.tuwien.ac.at/franz/latex-bench, I have a Raspi 3
    result (and the Raspi 3 with its 1200MHz 2-wide core is quite a bit
    faster than the 700Mhz ARM11), and also some CPUs similar to those
    used in the Eee PC; numbers are times in seconds:

    - Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
    - Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89
    - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
    - AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216

    So all of these CPUs clearly beat the one in the Raspi3, which I
    expect to be clearly faster than the ARM11.


    IIRC, I was running Debian on the Eee (IIRC because the Xandros it came
    with was kinda useless).

    The one I have being one of the 701 variants (would need to find it
    again to know the model). Looking online, it was probably one of the underclocked Celeron models though.


    Not sure how fast (or not fast) it was, but it was basically about
    enough to run Quake and Quake 2 in 640x480, but was hard pressed to do
    much more than this (and be playable).

    Trying to use Firefox or similar on it was just kinda painful.


    Now imagine running the software that made the Eee PC so slow with
    dynamic translation on a Raspi1. How slow would that be?


    Seemingly the RasPi could run Quake OK in 800x600 though...
    And, also did well working with CRAM video.


    By other subjective measures, at least the GUI on the RasPi didn't
    behave like molasses.

    So, in any case, a better user experience at least (with some
    uncertainty as to the actual speed).


    Granted, might have been relevant to time running GCC builds or similar
    for a more objective measure, would need to find both.


    Though, at least, an emulator would need to be faster than DOSBox, as
    DOSBox on RasPi tends to be too slow to even really run Doom or similar.

    My cellphone at least gave a slightly better experience running DOSBox
    (well, except that DOSBox and Termux on Android occasionally forget all
    of their local storage and get reverted to their default contents).

    RasPi+DOSBox can at least seemingly run Windows 3.11 and similar though.

    Though, AFAIK DOSBox on ARM is running purely as an interpreter.


    I remember though that one time I did try doing custom code generation
    on the RasPi, and performance was terrible. At the time it seemed like
    there was some "secret sauce" that GCC had to not get terrible performance.

    Though, IIRC, this was a fork where I had tried to modified BGBCC's
    SuperH backend to be able to target Thumb2.



    Or, seeming informal/subjective ranking (mostly from memory):

    Eee (CPU = something slow):
    Quake 2, 640x400, OK-ish
    Quake 3, N/A, didn't work
    (No memcpy score or formal benchmarks)

    Laptop from 2003 (1.4GHz Athlon, of some variant):
    Quake 1/2: 1024x768, runs well.
    (1024x768 is max resolution of LCD).
    Quake 3: Also runs well.
    As did GLQuake and Quake2 in OpenGL.
    Half-Life runs well.
    Half-Life 2, ran but poorly.
    Gets around 400MB/s in a memcpy benchmark.
    DDR1 100 MHz (or, DDR-200)
    Notably lower than theoretical bandwidth.
    (No values for LZ4 or CRAM tests IIRC)

    RasPi 1 (700 MHz ARM11):
    Quake 800x600 runs OK.
    Quake 3: Ran, but poorly.
    Gets around 1.2 GB/sec in memcpy.
    Around 300 MB/s LZ4 decode
    Around 400 Mpix/sec in CRAM decode.

    RasPi 3 (1400 MHz 4x A53):
    Quake 1/2/3 and GLQuake and Q3A run well.
    Gets around 1.6 GB/sec in memcpy.
    Around 500 MB/s LZ4 decode
    Around 700 Mpix/sec in CRAM decode.

    Laptop from 2009 (2.1 GHz Core 2, 2 cores):
    Quake 1/2 and Half-Life are 60 fps at max resolution (1440x900).
    In SW rendering only.
    It was a very good option if you were OK with software rendering.
    Quake 3: Around 20 fps.
    GLQuake and Quake3 perform like dog crap.
    GPU: Intel GMA X3100
    Half-Life 2: Also very poor.
    Minecraft ran, but unplayable.
    Even on lowest draw distance.
    Doom 3, started up at least...
    Severe graphical glitches (lighting didn't work correctly)
    Dead slow.
    Around 2.4 GB/sec in memcpy.
    Around 2.0 GB/s in LZ4
    Around 1500 Mpix/sec in CRAM decode.
    Performs well in CPU based tasks.
    OpenGL via Software rasterization almost as fast as the GPU.

    Current PC (Ryzen 2700X, 3.7GHz, 8C16T)
    No issues running any of these games.
    Memcpy: 3.6 GB/sec.
    DDR4-2133
    Around 3.2 GB/sec in LZ4
    Around 2000 Mpix/sec in CRAM decode.


    As can be noted:
    memcpy tests tend to measure lower than RAM bandwidth.
    CRAM decode often tends to exceed memcpy.
    My mempy and LZ4 tests are single threaded.
    Multi-threading can often give higher total bandwidth.


    The bulk of time in CRAM decoding is spent in logic like:
    tab[0]=colorA;
    tab[1]=colorB;
    px0=tab[(pix>>0)&1]; px1=tab[(pix>>1)&1];
    px2=tab[(pix>>2)&1]; px3=tab[(pix>>3)&1];
    ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;
    ct+=stride;
    px0=tab[(pix>>4)&1]; px1=tab[(pix>>5)&1];
    px2=tab[(pix>>6)&1]; px3=tab[(pix>>7)&1];
    ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;
    ct+=stride;
    px0=tab[(pix>> 8)&1]; px1=tab[(pix>> 9)&1];
    px2=tab[(pix>>10)&1]; px3=tab[(pix>>11)&1];
    ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;
    ct+=stride;
    px0=tab[(pix>>12)&1]; px1=tab[(pix>>13)&1];
    px2=tab[(pix>>14)&1]; px3=tab[(pix>>15)&1];
    ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;


    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Sep 21 16:20:00 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> wrote:
    On 9/20/2025 8:10 AM, Waldek Hebisch wrote:
    BGB <cr88192@gmail.com> wrote:
    On 9/19/2025 9:33 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

    And most Intel and AMD chips have 150W TDP, either, although the
    shenanigans they play with TDP are not nice. The usual TDP for
    Desktop chips is 65W (with the power limits temporarily or permanently >>>> higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
    configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
    a configurable TDP of 8-37W.



    Seems so...
    Seems the CPU I am running as a 105W TDP, I had thought I remembered
    150W, oh well...

    Seems 150-200W is more Threadripper territory, and not the generic
    desktop CPUs.


    Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >>>>> slower, it may still win in Perf/W and similar...

    No TDP numbers are given for Oryon. For Apple's M4, the numbers are

    M4 4P 6E 22W
    M4 Pro 8P 4E 38W
    M4 Pro 10P 4E 46W
    M4 Max 10P 4E 62W
    M4 Max 12P 4E 70W

    Not quite 1/30th of the power, although I think that Apple does not
    play the same shenanigans as Intel and AMD.



    A lot of the ARM SoC's I had seen had lower TDPs, though more often with >>> Cortex A53 or A55/A78 cores or similar:

    Say (MediaTek MT6752):
    https://unite4buy.com/cpu/MediaTek-MT6752/
    Has a claimed TDP here of 7W and has 8x A53.

    Or, for a slightly newer chip (2020):
    https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

    TDP 5W, has A55 and A78 cores.


    Some amount of the HiSilicon numbers look similar...


    But, yeah, I guess if using these as data-points:
    A55: ~ 5/8W, or ~ 0.625W (very crude)
    Zen+: ~ 105/16W, ~ 6.56W

    So, more like 10x here, but ...


    Then, I guess it becomes a question of the relative performance
    difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

    Judging based on my cellphone (with A53 cores), and previously running
    my emulator in Termux, there is a performance difference, but nowhere
    near 10x.

    Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks >> to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to >> about 50000 DMIPS. Dhrystone contain string operations which benefit
    from SSE/AVX, but I would expect that on media load speed ratio would
    be even more favourable to desktop core. On jumpy code ratio is probably
    lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.

    It is hard to compare performance per watt: Orange Pi Zero 3 has low
    power draw (of order 100 mA from 5V USB charger with one core active) and
    it is not clear how it is distributed between CPU-s and Etherent interface. >> RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
    normally seem to run at at fraction of rated power too (but I have
    no way to directly measure CPU power draw).

    Of course, there is a catch: desktop CPU is made on more advanced
    process than small processors. So it is hard to separate effects
    from architecture and from the process.


    I had noted before that when I compiled Dhrystone on my Ryzen using
    MSVC, it is around 10M, or 5691 DMIPs, or around 1.53 DMIPs/MHz.

    Curiously, the score is around 4x higher (around 40M) if Dhrystone is compiled with GCC (and around 2.5x with Clang).

    For most other things, the performance scores seem closer.

    I don't really trust GCC's and Clang's Dhrystone scores as they seem basically out-of-line with most other things I can measure.

    I would not totally dismiss Dhrystone scores. Apparently Dhrystone
    allows more optimizations than other programs. There may be bias,
    because GCC and Clang developers select optimizations to improve
    benchark scores. But AFAICS compiled code performs work it should
    do. And the work correspond to typical work mix from the past.
    More important, optimizations on gcc are mostly independent of
    architecture, so essentially the same optimizations are applied
    on all machines.

    BTW: I get similar Dhrystone results from GCC Clang (differences of
    few percent or less).

    Concerning other loads, my current desktop (12 cores) build a medium size program about 8.5 times faster than 4 core Core 2 from 2008. There
    is non-negilgable serial part in the build, so single modern core is about
    3 times faster than single core in Core 2. I do not have comparable
    results for 64-bit Orange Pi, but on slow machines I see build times
    that are 40 times longer. Big part is numebr of cores, hypertheading
    helps too (real time using 20 jobs is significanty smaller than real
    time using 12 jobs). But clearly single big core is significanlty
    faster than smaller cores.

    Part of advantage of big core is due to big caches, my understanding
    is that smaller processors that I use have much smaller caches.

    Noting my BJX2 core seems to perform at 90K at 50MHz, or 1.02 DMIPS/MHz.
    If assuming MSVC as the reference, this would imply (after normalizing
    for clock-speeds) that the Ryzen only gets around 50% more IPC.



    I noted when compiling my BJX2 emulator:
    My Ryzen can emulate it at roughly 70MHz;
    My cell-phone can manage it at roughly 30MHz.

    This isn't *that* much larger than the difference in CPU clock speeds.


    It is like, I seemingly live in a world where a lot of my own benchmark attempts tend to be largely correlated with the relative different in
    clock speeds and similar.

    Well, clock speeds is major factor for power efficiency. Running CPU
    and lower clock freqency significanlty lowers energy per instruction.
    And mere capability to run at high clock freqency causes increased
    power use at lower clock freqencies (IIUC high freqency may need
    bigger transistors and/or more transistors).

    Well, except for my old laptop (from 2003), and an ASUS Eee, which seem
    to perform somewhat below that curve.


    Though, in the case of the laptop, it may be a case of not getting all
    that much memory bandwidth from a 100MHz DDR1 SO-DIMM (a lot of the performance on some tests seems highly correlated with "memcpy()"
    speeds, and on that laptop, its memcpy speeds are kinda crap if compared with CPU clock-speed).

    Well, and the Eee has, IIRC, an Intel Atom N270 down-clocked to 630 MHz.
    Thing ran Quake and Quake 2 pretty OK, but not much else.


    Though, if running the my emulator on the laptop, it is more back on the curve of relative clock-speed, rather than on the
    relative-memory-bandwidth curve.

    It seems both my neural-net stuff and most of my data compression stuff, more follow the memory bandwidth curve (though, for the laptop, it seems
    NN stuff can get a big boost here by using BFloat16 and getting a little clever with the repacking).


    Well, and then my BJX2 core seems to punch slightly outside its weight
    class (MHz wise) by having disproportionately high memory bandwidth.

    ...


    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Mon Sep 22 03:21:17 2025
    From Newsgroup: comp.arch

    In article <bp4jck19kcmq4i571fiofcrk1k6nn9k0ha@4ax.com>,
    George Neuner <gneuner2@comcast.net> wrote:
    On Tue, 16 Sep 2025 00:03:51 -0000 (UTC), John Savard ><quadibloc@invalid.invalid> wrote:

    On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

    Although it's called "inverse hyperthreading", this technique could be
    combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    On further reflection, this may be equivalent to re-inventing out-of-order >>execution.

    John Savard

    Sounds more like dynamic micro-threading.

    Over the years I've seen a handful of papers about compile time >micro-threading: that is the compiler itself identifies separable
    dependency chains in serial code and rewrites them into deliberate
    threaded code to be executed simultaneously.

    It is not easy to do under the best of circumstances and I've never
    seen anything about doing it dynamically at run time.

    To make a thread worth rehosting to another core, it would need to be
    (at least) many 10s of instructions in length. To figure this out >dynamically at run time, it seems like you'd need the decode window to
    be 1000s of instructions and a LOT of "figure-it-out" circuitry.


    MMV, but to me it doesn't seem worth the effort.

    I began reading the patent, and it's not clear to me this approach is
    going to be much of an improvement. A great deal of analysis magic has
    to happen to find code to spread across the cores. To summarize, it's basically taking code that looks like:

    for(i = 0; i < N; i++) {
    // Do some work
    }

    for(i = 0; i < M; i++) {
    // Do some different work
    }

    and have two cores run the loops at the same time, with some special
    check hardware to make sure they really are dependent (I gave up before
    really figuring out what they're going to do, patents are not fun to read).
    I think they actually want to divide up each loop into sections, and do
    them in parallel. If someone wanted to explain in better detail what
    they are doing, I'd like to read that short summary in non-patentese.

    A trivial alternative approach to shrinking core size while not losing
    single thread speed is to basically make all cores Narrow (meaning
    support something like 4 instructions wide), and when code needs more,
    stall the neighboring core and steal it's functional units to form a new
    8-wide core. This approaches the SMT hardware sharing in a different direction, and so code without much instruction parallelism will run
    better on two smaller cores than on a big core with two threads, but if
    a single thread can use 8-wide instruction execution, it can steal it from
    the neighboring core for a while.

    If that's too much trouble, then for x86, all cores have just AVX-256 width, and take two clocks to do each AVX-512 operation (which is still better than just AVX-256). But hardware can join the neighboring cores together to be AVX-512, with each AVX-512 op taking just one clock now (and this can just
    be AVX, the other core can run other instructions unimpeded).

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Sep 22 11:28:13 2025
    From Newsgroup: comp.arch

    But, AFAIK the ARM cores tend to use significantly less power when
    emulating x86 than a typical Intel or AMD CPU, even if slower.

    AFAIK datacenters still use a lot of x86 CPUs, even though most of them
    run software that's just as easily available for ARM. And many
    datacenters care more about "perf per watt" than raw performance.

    So, I think the difference in power consumption does not favor ARM
    nearly as significantly as you think.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Sep 22 20:28:33 2025
    From Newsgroup: comp.arch

    On 22/09/2025 17:28, Stefan Monnier wrote:
    But, AFAIK the ARM cores tend to use significantly less power when
    emulating x86 than a typical Intel or AMD CPU, even if slower.

    AFAIK datacenters still use a lot of x86 CPUs, even though most of them
    run software that's just as easily available for ARM. And many
    datacenters care more about "perf per watt" than raw performance.

    So, I think the difference in power consumption does not favor ARM
    nearly as significantly as you think.


    Yes, I think that is correct.

    A lot of it, as far as I have read, comes down to the type of
    calculation you are doing. ARM cores can often be a lot more efficient
    at general integer work and other common actions, as a result of a
    better designed instruction set and register set. But once you are
    using slightly more specific hardware features - vector processing,
    floating point, acceleration for cryptography, etc., it's all much the
    same. It takes roughly the same energy to do these things regardless of
    the instruction set. Cache memory takes about the same power, as do PCI interfaces, memory interfaces, and everything else that takes up power
    on a chip.

    So when you have a relatively small device - such as what you need for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than x86. (If you go
    smaller - small embedded systems - x86 is totally non-existent because
    an x86 microcontroller would be an order of magnitude bigger, more
    expensive and power-consuming than an ARM core.) But when you have big processors for servers, and are using a significant fraction of the processor's computing power, the details of the core matter a lot less.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Sep 22 19:36:05 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 22/09/2025 17:28, Stefan Monnier wrote:
    But, AFAIK the ARM cores tend to use significantly less power when
    emulating x86 than a typical Intel or AMD CPU, even if slower.

    AFAIK datacenters still use a lot of x86 CPUs, even though most of them
    run software that's just as easily available for ARM. And many
    datacenters care more about "perf per watt" than raw performance.

    So, I think the difference in power consumption does not favor ARM
    nearly as significantly as you think.


    Yes, I think that is correct.

    A lot of it, as far as I have read, comes down to the type of
    calculation you are doing. ARM cores can often be a lot more efficient
    at general integer work and other common actions, as a result of a
    better designed instruction set and register set. But once you are
    using slightly more specific hardware features - vector processing,
    floating point, acceleration for cryptography, etc., it's all much the
    same. It takes roughly the same energy to do these things regardless of
    the instruction set. Cache memory takes about the same power, as do PCI interfaces, memory interfaces, and everything else that takes up power
    on a chip.

    So when you have a relatively small device - such as what you need for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than x86. (If you go smaller - small embedded systems - x86 is totally non-existent because
    an x86 microcontroller would be an order of magnitude bigger, more
    expensive and power-consuming than an ARM core.) But when you have big processors for servers, and are using a significant fraction of the processor's computing power, the details of the core matter a lot less.

    Big servers have rather equal power in the peripherals {DISKs, SSDs, and
    NICs} and DRAM {plus power supplies and cooling} than in the cores.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Sep 23 08:24:54 2025
    From Newsgroup: comp.arch

    On 22/09/2025 21:36, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 22/09/2025 17:28, Stefan Monnier wrote:
    But, AFAIK the ARM cores tend to use significantly less power when
    emulating x86 than a typical Intel or AMD CPU, even if slower.

    AFAIK datacenters still use a lot of x86 CPUs, even though most of them
    run software that's just as easily available for ARM. And many
    datacenters care more about "perf per watt" than raw performance.

    So, I think the difference in power consumption does not favor ARM
    nearly as significantly as you think.


    Yes, I think that is correct.

    A lot of it, as far as I have read, comes down to the type of
    calculation you are doing. ARM cores can often be a lot more efficient
    at general integer work and other common actions, as a result of a
    better designed instruction set and register set. But once you are
    using slightly more specific hardware features - vector processing,
    floating point, acceleration for cryptography, etc., it's all much the
    same. It takes roughly the same energy to do these things regardless of
    the instruction set. Cache memory takes about the same power, as do PCI
    interfaces, memory interfaces, and everything else that takes up power
    on a chip.

    So when you have a relatively small device - such as what you need for a
    mobile phone - the instruction set and architecture makes a significant
    difference and ARM is a lot more power-efficient than x86. (If you go
    smaller - small embedded systems - x86 is totally non-existent because
    an x86 microcontroller would be an order of magnitude bigger, more
    expensive and power-consuming than an ARM core.) But when you have big
    processors for servers, and are using a significant fraction of the
    processor's computing power, the details of the core matter a lot less.

    Big servers have rather equal power in the peripherals {DISKs, SSDs, and NICs} and DRAM {plus power supplies and cooling} than in the cores.

    Yes, all that will be independent of the type of cpu core.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 24 21:08:10 2025
    From Newsgroup: comp.arch

    On Mon, 22 Sep 2025 19:36:05 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 22/09/2025 17:28, Stefan Monnier wrote:
    But, AFAIK the ARM cores tend to use significantly less power
    when emulating x86 than a typical Intel or AMD CPU, even if
    slower.

    AFAIK datacenters still use a lot of x86 CPUs, even though most
    of them run software that's just as easily available for ARM.
    And many datacenters care more about "perf per watt" than raw performance.

    So, I think the difference in power consumption does not favor ARM
    nearly as significantly as you think.


    Yes, I think that is correct.

    A lot of it, as far as I have read, comes down to the type of
    calculation you are doing. ARM cores can often be a lot more
    efficient at general integer work and other common actions, as a
    result of a better designed instruction set and register set. But
    once you are using slightly more specific hardware features -
    vector processing, floating point, acceleration for cryptography,
    etc., it's all much the same. It takes roughly the same energy to
    do these things regardless of the instruction set. Cache memory
    takes about the same power, as do PCI interfaces, memory
    interfaces, and everything else that takes up power on a chip.

    So when you have a relatively small device - such as what you need
    for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than
    x86. (If you go smaller - small embedded systems - x86 is totally non-existent because an x86 microcontroller would be an order of
    magnitude bigger, more expensive and power-consuming than an ARM
    core.) But when you have big processors for servers, and are using
    a significant fraction of the processor's computing power, the
    details of the core matter a lot less.

    Big servers have rather equal power in the peripherals {DISKs, SSDs,
    and NICs} and DRAM {plus power supplies and cooling} than in the
    cores.


    Still, CPU power often matters.
    Spec.org has special benchmark for that called SPECpower_ssj 2008.
    It is old and java-oriented but I don't think that it is useless.

    Right now the benchmark clearly shows that AMD offferings dominate
    Intel's.
    The best AMD score is 44168 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2025q2/power_ssj2008-20250407-01522.html


    The best Intel score are 25526 ssj_ops/watt (Sierra Forest) and 25374 ssj_ops/watt (Granite Rapids). Both lag behind ~100 AMD scores,
    They barely beats some old EPYC3 scores from 2021. https://www.spec.org/power_ssj2008/results/res2025q3/power_ssj2008-20250811-01533.html
    https://www.spec.org/power_ssj2008/results/res2025q1/power_ssj2008-20250310-01505.html


    There are very few non-x86 submissions. The only one that I found in
    last 5 years was using Nvidia Grace CPU Superchip based on Arm Inc.
    Neoverse V2 cores. It scored 13218 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2024q3/power_ssj2008-20240515-01413.html













    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Wed Sep 24 15:56:37 2025
    From Newsgroup: comp.arch

    On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
    <already5chosen@yahoo.com> wrote:

    On Mon, 22 Sep 2025 19:36:05 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Big servers have rather equal power in the peripherals {DISKs, SSDs,
    and NICs} and DRAM {plus power supplies and cooling} than in the
    cores.


    Still, CPU power often matters.

    Yes ... and no.

    80+% of the power used by datacenters is devoted to cooling the
    computers - not to running them. At the same time, most of the heat
    generated by typical systems is due to the RAM - not the CPU(s).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Sep 24 20:00:07 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Mon, 22 Sep 2025 19:36:05 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 22/09/2025 17:28, Stefan Monnier wrote:
    But, AFAIK the ARM cores tend to use significantly less power
    when emulating x86 than a typical Intel or AMD CPU, even if
    slower.

    AFAIK datacenters still use a lot of x86 CPUs, even though most
    of them run software that's just as easily available for ARM.
    And many datacenters care more about "perf per watt" than raw performance.

    So, I think the difference in power consumption does not favor ARM nearly as significantly as you think.


    Yes, I think that is correct.

    A lot of it, as far as I have read, comes down to the type of calculation you are doing. ARM cores can often be a lot more
    efficient at general integer work and other common actions, as a
    result of a better designed instruction set and register set. But
    once you are using slightly more specific hardware features -
    vector processing, floating point, acceleration for cryptography,
    etc., it's all much the same. It takes roughly the same energy to
    do these things regardless of the instruction set. Cache memory
    takes about the same power, as do PCI interfaces, memory
    interfaces, and everything else that takes up power on a chip.

    So when you have a relatively small device - such as what you need
    for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than
    x86. (If you go smaller - small embedded systems - x86 is totally non-existent because an x86 microcontroller would be an order of magnitude bigger, more expensive and power-consuming than an ARM
    core.) But when you have big processors for servers, and are using
    a significant fraction of the processor's computing power, the
    details of the core matter a lot less.

    Big servers have rather equal power in the peripherals {DISKs, SSDs,
    and NICs} and DRAM {plus power supplies and cooling} than in the
    cores.


    Still, CPU power often matters.
    Spec.org has special benchmark for that called SPECpower_ssj 2008.
    It is old and java-oriented but I don't think that it is useless.

    Right now the benchmark clearly shows that AMD offferings dominate
    Intel's.
    The best AMD score is 44168 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2025q2/power_ssj2008-20250407-01522.html


    The best Intel score are 25526 ssj_ops/watt (Sierra Forest) and 25374 ssj_ops/watt (Granite Rapids). Both lag behind ~100 AMD scores,
    They barely beats some old EPYC3 scores from 2021. https://www.spec.org/power_ssj2008/results/res2025q3/power_ssj2008-20250811-01533.html
    https://www.spec.org/power_ssj2008/results/res2025q1/power_ssj2008-20250310-01505.html


    There are very few non-x86 submissions. The only one that I found in
    last 5 years was using Nvidia Grace CPU Superchip based on Arm Inc.
    Neoverse V2 cores. It scored 13218 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2024q3/power_ssj2008-20240515-01413.html


    A quick survey of the result database indicates only Oracle is
    sending results to the data base.

    Would be interesting to see the Apple/ARM comparisons.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 24 23:37:17 2025
    From Newsgroup: comp.arch

    On Wed, 24 Sep 2025 20:00:07 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:


    A quick survey of the result database indicates only Oracle is
    sending results to the data base.


    You misread it.
    The organization that submits result is listed as "Test Sponsor".
    Oracle is a sponsore of none of results that I listed in my previous
    post.
    The sponsors are ASUSTeK Computer Inc, New H3C Technologies Co, Lenovo
    Global Technology and Infobell IT Solutions Pvt.

    The most recent submissions are by Dell and Lenovo. https://www.spec.org/power_ssj2008/results/res2025q3/

    Would be interesting to see the Apple/ARM comparisons.

    Would be very interesting, but not going to happen.
    Last time Apple submitted something to spec.org was almost 20 yers ago.
    And it never submitted to Spec Power SSJ, which sort of makes sense -
    this is a benchmark designed for severs and Apple does not sell servers.

    The ARM architectecture vendor with highest number of submissions to
    spec.org is Ampere, but they abondoned Arm-designed cores couple of
    years ago and now shipping Arm architecture CPUs with cores of their
    own design.
    However there are few results in the database that use their previous
    offerings based on Arm Neovese-N1 cores. Here is the best result: https://www.spec.org/power_ssj2008/results/res2024q1/power_ssj2008-20231104-01332.html




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 24 23:48:50 2025
    From Newsgroup: comp.arch

    On Wed, 24 Sep 2025 15:56:37 -0400
    George Neuner <gneuner2@comcast.net> wrote:

    On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
    <already5chosen@yahoo.com> wrote:

    On Mon, 22 Sep 2025 19:36:05 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Big servers have rather equal power in the peripherals {DISKs,
    SSDs, and NICs} and DRAM {plus power supplies and cooling} than in
    the cores.


    Still, CPU power often matters.

    Yes ... and no.

    80+% of the power used by datacenters is devoted to cooling the
    computers - not to running them.

    I think that it's less than 80%. But it does not matter and does not
    change anything - power spent for coooling is approximately
    proportional to power spent for runninng.

    At the same time, most of the heat
    generated by typical systems is due to the RAM - not the CPU(s).


    I don't think that you have scientific study to support your claims.

    That's before than I state the obvious - even if you were correct about
    main RAM consuming more power than CPU (which I doubt very much), still different CPUs can perform the same job with very different number of
    main RAM accesses.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Sep 24 21:04:03 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 24 Sep 2025 15:56:37 -0400
    George Neuner <gneuner2@comcast.net> wrote:

    On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
    <already5chosen@yahoo.com> wrote:

    On Mon, 22 Sep 2025 19:36:05 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Big servers have rather equal power in the peripherals {DISKs,
    SSDs, and NICs} and DRAM {plus power supplies and cooling} than in
    the cores.


    Still, CPU power often matters.

    Yes ... and no.

    80+% of the power used by datacenters is devoted to cooling the
    computers - not to running them.

    I think that it's less than 80%. But it does not matter and does not
    change anything - power spent for coooling is approximately
    proportional to power spent for runninng.

    At the same time, most of the heat
    generated by typical systems is due to the RAM - not the CPU(s).

    A typical 16GB dimm module will dissipate 3-5 watts. So 128GB will
    draw in the vincinity of 32 watts. The TDP for a high-end
    xeon may exceed 350 watts, Diamond Rapids may exceed 500 watts.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 25 00:21:02 2025
    From Newsgroup: comp.arch

    On Wed, 24 Sep 2025 21:04:03 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    <...>

    Scott,
    When you answer George Neuner's point, can you, please, reply to George Neuner's post rather than to mine?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Sep 24 21:27:09 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 24 Sep 2025 15:56:37 -0400
    George Neuner <gneuner2@comcast.net> wrote:
    80+% of the power used by datacenters is devoted to cooling the
    computers - not to running them.

    In the old times I heard that they used about as much power for
    cooling as goes into the machines. In recent times, I have heard
    about success stories where they use less. <https://en.wikipedia.org/wiki/Coefficient_of_performance> says: "Most
    air conditioners have a COP of 3.5 to 5", i.e., quite a bit less
    energy is expended on cooling than is moved away.

    At the same time, most of the heat
    generated by typical systems is due to the RAM - not the CPU(s).

    Where do you get this from?

    A typical 16GB dimm module will dissipate 3-5 watts. So 128GB will
    draw in the vincinity of 32 watts.

    We have several machines with 128GB RAM. They idle at around 40W, and
    a box with less RAM and otherwise the same hardware does not idle at
    much lower power consumption. The RAM has no active cooler, no
    passive cooler, and sits close to each other, so it cannot dissipate
    lots of power, certainly not 32W.

    By contrast, the CPUs on these machines have elaborate active cooling solutions, and consume 105W TDP (142W power limit).

    SSDs are also unlikely to be consuming a lot of power, given the kind
    of cooling that they get. Yes, there are elaborate coolers for
    M.2-format SSDs, but these is not the kind of format that the bigger
    servers use (which rather use U.2 or U.3 SSDs), and even with M.2,
    there is usually no need to use SSD cooling.

    Maybe if you have a huge number of SSDs, power consumption may rival
    that of the CPU.

    - antn
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Sep 24 18:38:06 2025
    From Newsgroup: comp.arch

    80+% of the power used by datacenters is devoted to cooling the
    computers - not to running them.

    Is it really *that* inefficient? Sounds even more horrible than what
    I'd expect. Do you have some reference?

    At the same time, most of the heat generated by typical systems is due
    to the RAM - not the CPU(s).

    Even if we consider "CPUs" their power consumption can go much further
    than just that of the cores. I remember reading about Threadripper
    spending about half its power in the its interconnect.
    Still, I suspect you need a lot of RAM before it starts consuming more
    power than your CPUs (at least the kind of RAM you find in gaming
    desktops consume significantly less than the CPU, last I checked), so it
    likely depends on the workloads that are targeted.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 25 14:23:04 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 24 Sep 2025 21:04:03 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    <...>

    Scott,
    When you answer George Neuner's point, can you, please, reply to George >Neuner's post rather than to mine?


    The attributions are there, as are the appropriate indentation markers ('>').

    Once I've read an article and restarted my newsreader, I don't have access
    to read articles (at least not easily).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 25 17:49:13 2025
    From Newsgroup: comp.arch

    On Thu, 25 Sep 2025 14:23:04 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Once I've read an article and restarted my newsreader, I don't have
    access to read articles (at least not easily).

    Does not it suck?


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (M. Anton Ertl) to comp.arch on Thu Sep 25 15:28:56 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    Once I've read an article and restarted my newsreader, I don't have access
    to read articles (at least not easily).

    I press the "Goto parent" button, and I think that already existed in
    xrn-9.03, which you use; maybe you need to configure it, or use the
    shortcut if one exists. The only problem is that if the parent is
    read, but an ancestor article is unread, it will skip the parent and
    go to that ancestor. If I ever find the time, I will fix that and
    send a patch to Jonathan Kamens.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 25 15:37:49 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 25 Sep 2025 14:23:04 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Once I've read an article and restarted my newsreader, I don't have
    access to read articles (at least not easily).

    Does not it suck?

    Not really. I've been using the same client since 1989; I'm used to it.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 25 15:41:30 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (M. Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Once I've read an article and restarted my newsreader, I don't have access >>to read articles (at least not easily).

    I press the "Goto parent" button, and I think that already existed in >xrn-9.03,

    yes, it has always existed, and yes, I can use it, but it is quite
    slow over NNTP. As the quoting is always accurate,
    I generally don't feel it is necessary in the case that Michael
    complained about.

    I can also hand-edit ~/.newsrc to see older articles, but seldom
    have the need.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Sep 25 23:16:00 2025
    From Newsgroup: comp.arch

    George Neuner wrote:
    On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
    <already5chosen@yahoo.com> wrote:

    On Mon, 22 Sep 2025 19:36:05 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Big servers have rather equal power in the peripherals {DISKs, SSDs,
    and NICs} and DRAM {plus power supplies and cooling} than in the
    cores.


    Still, CPU power often matters.

    Yes ... and no.

    80+% of the power used by datacenters is devoted to cooling the
    computers - not to running them. At the same time, most of the heat
    generated by typical systems is due to the RAM - not the CPU(s).

    I am quite sure that number is simply bogus: The power factors we were
    quoted when building the largest new datacenter in Norway 10+ years ago,
    was more like 6-10% of total power for cooling afair.

    .. a quick google...

    https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

    This one claims a 1.07 Power Usage Effectiveness.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 25 23:48:19 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    George Neuner wrote:
    On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
    <already5chosen@yahoo.com> wrote:

    On Mon, 22 Sep 2025 19:36:05 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Big servers have rather equal power in the peripherals {DISKs, SSDs,
    and NICs} and DRAM {plus power supplies and cooling} than in the
    cores.


    Still, CPU power often matters.

    Yes ... and no.

    80+% of the power used by datacenters is devoted to cooling the
    computers - not to running them. At the same time, most of the heat generated by typical systems is due to the RAM - not the CPU(s).

    I am quite sure that number is simply bogus: The power factors we were quoted when building the largest new datacenter in Norway 10+ years ago,
    was more like 6-10% of total power for cooling afair.

    . a quick google...

    https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

    This one claims a 1.07 Power Usage Effectiveness.

    All of this depends on where the "cold sink" is !! and how cold it is.

    Pumping 6ºC sea water through water to air heat exchangers is a lot
    more power efficient than using FREON and dumping the heat into 37ºC
    air.

    I still suspect that rectifying and delivering clean (low noise) D/C
    to the chassis' takes a lot more energy that taking the resulting heat
    away.

    Flash will have low heat signature
    DRAM will have significant heat signature
    DISKs will have significant heat signature
    GPUs will have significant heat signature
    CPUs will have significant heat signature
    Motherboard has low-medium heat signature


    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Sep 26 02:03:21 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:


    I am quite sure that number is simply bogus: The power factors we were
    quoted when building the largest new datacenter in Norway 10+ years ago,
    was more like 6-10% of total power for cooling afair.

    . a quick google...

    https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

    This one claims a 1.07 Power Usage Effectiveness.

    All of this depends on where the "cold sink" is !! and how cold it is.

    Pumping 6ºC sea water through water to air heat exchangers is a lot
    more power efficient than using FREON and dumping the heat into 37ºC
    air.

    I still suspect that rectifying and delivering clean (low noise) D/C
    to the chassis' takes a lot more energy that taking the resulting heat
    away.

    The FB article above describes how they reduced the
    losses due to voltage changes as well as rectification.

    Consider that there are losses converting from the
    primary (e.g. 22kv) to 480v (2%), and additional losses
    converting to 208v (3%) to the UPS. That's before any
    rectification losses (6% to 12%). With various optimizations,
    they reduced total losses to 7.5%, including rectification
    and transformation from the primary voltage.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Sep 25 23:30:27 2025
    From Newsgroup: comp.arch

    On 9/25/2025 9:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:


    I am quite sure that number is simply bogus: The power factors we were
    quoted when building the largest new datacenter in Norway 10+ years ago, >>> was more like 6-10% of total power for cooling afair.

    . a quick google...

    https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

    This one claims a 1.07 Power Usage Effectiveness.

    All of this depends on where the "cold sink" is !! and how cold it is.

    Pumping 6ºC sea water through water to air heat exchangers is a lot
    more power efficient than using FREON and dumping the heat into 37ºC
    air.

    I still suspect that rectifying and delivering clean (low noise) D/C
    to the chassis' takes a lot more energy that taking the resulting heat
    away.

    The FB article above describes how they reduced the
    losses due to voltage changes as well as rectification.

    Consider that there are losses converting from the
    primary (e.g. 22kv) to 480v (2%), and additional losses
    converting to 208v (3%) to the UPS. That's before any
    rectification losses (6% to 12%). With various optimizations,
    they reduced total losses to 7.5%, including rectification
    and transformation from the primary voltage.


    Hmm...

    Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.


    What if, opposed to each computer using its own power-supply (from 120
    or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

    Or, 2-stage, say:
    960V -> 192V (with 960V to each rack).
    192V -> 12V (with 192V to each server).

    Where the second stage drop could use slightly cheaper transistors, but
    still limiting electrical losses due to wire resistance, while still
    avoiding losses due to transformers and rectifiers.

    To balance cost and efficiency, could use, say, 8 or 10AWG CCA (copper
    clad aluminum) vs 10 or 12AWG copper. Could run the wires at a
    relatively lower amperage rating, say:
    8A over 10AWG CCA
    16A over 8AWG CCA
    Or, roughly 1/3 nominal.


    Where, CCA wire is a lot cheaper than copper wire, so it is easier to
    justify using absurdly thick wire here.

    Where, contrast say to running 8A over 20AWG, which works, but a fair
    bit more is lost due to heat. Or, the alternative could be to run the
    power over parallel thinner wires rather than a single thicker wire. For example, replacing each 10AWG wire with four 14AWG wires.

    8A at 192V being 1.5kW, and 8A at 960V being 7.7kW.


    Though, assuming a series of 16 racks running on each shared 960V bus,
    this would be 128A. The above de-rating scheme would likely make normal
    CCA wire impractical. Probably could distribute DC power over a pair of
    1.25" aluminum bars or 0.75" to 1.0" copper bars. Likely, the 1.25"
    aluminum bar being the cheaper option here.

    Could maybe then connect each 10AWG wire to the bars using a clamp,
    and/or use an intermediate socket or modular connector.

    Does kinda seem a bit overkill though.


    Main power distribution would likely need to operate at a higher
    voltage, otherwise the building-scale power rails would be absurd here.

    Say, if one assumes a monolithic 960VDC system, and 16 rows, this is
    2048A. Like, what does one do here, 3" copper or 5" aluminum rails?... Probably no.

    Well, or maybe get creative and use large aluminum I-beams that serve
    both as power distribution and joists (so, all this metal can serve
    additional purpose). Though, 960V through the joists seems like a
    building maintain maintenance hazard. Say, for example, 0V through the
    floor and 960V through the ceiling.


    Input power would likely need multiple transformers and rectifiers to be practical; though admittedly I have little idea here what sorts of
    diodes would be used in these rectifiers. Seems like each diode would
    itself need to be stupidly large to deal with this crap.


    As for cooling, could maybe either use liquid cooling, or hybrid
    aid/liquid (say, with superchilled liquid pumped through radiators, and
    then fans circle air through these radiators).

    To move lots of heat, could maybe use -90C ethanol as a coolant. Where
    ethanol can be pumped like water, but could be nearly as cold as Freon.
    Would likely still need big refrigeration pumps.

    If one could have an artificial lake outside (preferably with a
    sun-blocking cover), this could be used as a heat-sink.

    Where, say:
    Inner loop uses cold ethanol;
    Refrigeration system moves heat from ethanol loop to a water loop;
    The water loop pumps to/from an artificial lake used as a heat sink.
    If the lake is above ambient, it will dissipate heat, but if too much
    higher it would suffer evaporation looses.

    One idea here could be to have 2 levels of cover over the lake:
    The lower one is a metal cover painted black on both sides, placed
    roughly 20 inches over the surface of the water;
    The second cover is another 20 inches higher, painted black on the lower
    side and white on the upper side;
    The lower cover has a blocking wall to limit how much water vapor
    escapes, whereas the upper barrier is open to the sides (allowing air to
    flow through).

    As the water evaporates, it moves heat into the barrier, which then
    radiates heat (as black-body radiation) where the water condenses and
    falls back into the lake;
    The upper barrier partly absorbs heat from the lower layer, and also
    serves to reflect the sun. Air-flow between the layers can be used to
    radiate heat.

    One other possibility being to have a tall tapered tube (narrower near
    the top) with an open top, with the coolant water in the bottom (with
    the tube tube serving to reduce evaporation loss, as water is more
    likely to re-condense on the walls and fall back down than to escape the
    top). Could likely be made out of steel or similar, maybe black inside,
    white outside. Then maybe could heat the coolant water to around 70 or 80C.

    While in theory, a giant radiator could work, a sufficiently large
    radiator would likely be impractically expensive.


    Well, don't know what people actually do, this is just what comes to
    mind at the moment.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Sep 26 14:02:31 2025
    From Newsgroup: comp.arch

    On Thu, 25 Sep 2025 23:16:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    George Neuner wrote:
    On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
    <already5chosen@yahoo.com> wrote:

    On Mon, 22 Sep 2025 19:36:05 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Big servers have rather equal power in the peripherals {DISKs,
    SSDs, and NICs} and DRAM {plus power supplies and cooling} than
    in the cores.


    Still, CPU power often matters.

    Yes ... and no.

    80+% of the power used by datacenters is devoted to cooling the
    computers - not to running them. At the same time, most of the heat generated by typical systems is due to the RAM - not the CPU(s).

    I am quite sure that number is simply bogus: The power factors we
    were quoted when building the largest new datacenter in Norway 10+
    years ago, was more like 6-10% of total power for cooling afair.

    .. a quick google...

    https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

    This one claims a 1.07 Power Usage Effectiveness.

    Terje


    I think, 1.07 is for 480VAC outside data center building to 48VDC at
    server power plug.
    It does not include losses withing server
    - 48V to mostly 12V by server's PSU
    - 12V to the whole zoo of low voltages by on-board DC2DC.






    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Sep 26 12:10:41 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.

    I've never encountered that voltage. Direct current motors are
    also mostly being phased out (pun intended) by asynchronous motors
    with frequency inverters.

    What if, opposed to each computer using its own power-supply (from 120
    or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

    That makes little sense. If you're going to distribute power,
    distribute it as AC so you save one transformer.


    Or, 2-stage, say:
    960V -> 192V (with 960V to each rack).
    192V -> 12V (with 192V to each server).

    Where the second stage drop could use slightly cheaper transistors,

    Transistors?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Sep 26 16:32:59 2025
    From Newsgroup: comp.arch

    On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.

    I've never encountered that voltage. Direct current motors are
    also mostly being phased out (pun intended) by asynchronous motors
    with frequency inverters.


    Are you sure?
    Indeed, in industry, outside of transportation, asynchronous AC motors
    were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp.
    brushlees, AC sync, AC async) enjoy similar popularity.

    What if, opposed to each computer using its own power-supply (from
    120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

    That makes little sense. If you're going to distribute power,
    distribute it as AC so you save one transformer.


    I never was in big datacenter, but heard that they prefer DC.


    Or, 2-stage, say:
    960V -> 192V (with 960V to each rack).
    192V -> 12V (with 192V to each server).

    Where the second stage drop could use slightly cheaper transistors,


    Transistors?

    Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
    transistors.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Sep 26 14:28:02 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 9/25/2025 9:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:



    Consider that there are losses converting from the
    primary (e.g. 22kv) to 480v (2%), and additional losses
    converting to 208v (3%) to the UPS. That's before any
    rectification losses (6% to 12%). With various optimizations,
    they reduced total losses to 7.5%, including rectification
    and transformation from the primary voltage.


    Hmm...

    Brings up a thought: 960VDC is a semi-common voltage in industrial >applications IIRC.

    What if, opposed to each computer using its own power-supply (from 120
    or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.


    In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch on Fri Sep 26 07:37:59 2025
    From Newsgroup: comp.arch

    On 9/26/25 7:28 AM, Scott Lurndal wrote:

    In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).


    Is it still -48V?
    Historically, Bell System plant voltage, supplied by batteries.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Sep 26 15:07:40 2025
    From Newsgroup: comp.arch

    Al Kossow <aek@bitsavers.org> writes:
    On 9/26/25 7:28 AM, Scott Lurndal wrote:

    In those datacenters, the UPS distributes 48VDC to the rack components
    (computers, network switches, storage devices, etc).


    Is it still -48V?
    Historically, Bell System plant voltage, supplied by batteries.

    Yes. Using a postive ground system reduced corrosion in buried
    cabling. While corrosion is not generally an issue for datacenters,
    they use the same PDU's that the telcom industry uses.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 26 12:58:43 2025
    From Newsgroup: comp.arch

    On 9/26/2025 9:28 AM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 9/25/2025 9:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:



    Consider that there are losses converting from the
    primary (e.g. 22kv) to 480v (2%), and additional losses
    converting to 208v (3%) to the UPS. That's before any
    rectification losses (6% to 12%). With various optimizations,
    they reduced total losses to 7.5%, including rectification
    and transformation from the primary voltage.


    Hmm...

    Brings up a thought: 960VDC is a semi-common voltage in industrial
    applications IIRC.

    What if, opposed to each computer using its own power-supply (from 120
    or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.


    In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).

    OK.

    I had thought they were usually 120VAC or 240VAC.

    At least, what rack-servers I had encountered were usually one of these (sometimes they had the little switch on the power-supply set to 240V
    even in the US).

    Then again, can also note that when setting up my milling machine,
    lathe, and plasma table, that these were all using 240VAC for the power distribution to the various components. These were all Tormach machines though, so can't say for others.




    48VDC also makes sense, as it is common in other contexts. I sorta
    figured a higher voltage would have been used to reduce the wire
    thickness needed.

    Though, I don't actually know how real datacenters work here, just sort
    of coming up with something assuming optimizing for the target goals
    (powering all this stuff while minimizing electrical losses and cost).




    I did realize after posting that, if the main power rails were organized
    as a grid, the whole building could be done probably with 1.25" aluminum
    bars.

    Could power the grid of bars at each of the 4 corners, with maybe some
    central diagonal bars (which cross and intersect with the central part
    of the grid, and an additional square around the perimeter). Each corner supply could drive 512A, and with this layout, no bar or segment should
    exceed 128A.



    Assuming if they were using 240VAC, seems like the typical housing setup (12AWG wire) would be woefully insufficient. Would either need to be
    heavily built up and/or use much heavier gauge wiring.

    Or also solid copper or aluminum bars. Not sure if I had heard of this,
    usual idea IIRC was that people always use wire for AC power, except
    that if pushing a continuous load of several hundred amps, wire seems
    less practical (would need to be very thick, hard to work with, and expensive).


    Granted, more likely they would run the cable closer to the rated values
    and accept more energy loss due to electrical resistance (since, yeah, a
    1.25" bar or similar for 128A is a little excessive).


    Though, it seems likely that in this case, solid metal bars might be
    cheaper than using a whole lot of heavy gauge wire. And, repurposing
    generic aluminum bar-stock might be the cheapest option here (with joins either as aluminum clamps or via welding).


    If operating closer to conventional electrical ratings, could drop to
    0.375" bars for 128A. Going much thinner, voltage drops and heat would
    become an issue.

    So, say:
    0.250" likely high resistive loss.
    0.375" roughly nominal.
    0.750" maybe sufficiently low resistance
    (could likely handle 500A before significant heat)
    1.250" maybe overkill


    Well, and could maybe put a plastic coating or similar on the bars to
    limit accidental short-circuits. Decided to leave out analysis, but the
    most likely option (to balance cost and effectiveness) would likely be a post-install application of acrylic paint (latex paint would be
    insufficient, epoxy likely too expensive, ...).

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 26 15:23:38 2025
    From Newsgroup: comp.arch

    On 9/26/2025 8:32 AM, Michael S wrote:
    On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    Brings up a thought: 960VDC is a semi-common voltage in industrial
    applications IIRC.

    I've never encountered that voltage. Direct current motors are
    also mostly being phased out (pun intended) by asynchronous motors
    with frequency inverters.


    Are you sure?
    Indeed, in industry, outside of transportation, asynchronous AC motors
    were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp. brushlees, AC sync, AC async) enjoy similar popularity.



    IIRC, reluctance motors are also popular here. They are sorta like BLDC,
    but cheaper due to not needing big magnets (though, BLDC motors can give
    more power in a physically smaller package if compared with reluctance
    motors; but reluctance motors are still more compact if compared with AC induction motors).


    Like BLDC, it is possible to run reluctance motors at an exact speed.

    This is unlike AC induction motors where, although speed can be adjusted
    with a VFD, it isn't particularly exact as it depends on the load on the
    motor and similar. Accurate speed control on an induction motor will
    still require using an encoder, but they are still not good for
    positional control (and the effective "holding torque" of an AC
    induction motor is very low).

    Where more accuracy is needed, something like a big BLDC or reluctance
    motor with a servo-drive might be used (typically with hall-effect
    sensors in the stator).

    Generally, these motors can't be driven open-loop, as they are prone to
    "drop out" at relatively little load in these cases.


    Technically, the stator construction for a reluctance motor can be
    nearly identical to an induction motor, the main differences are in the
    design of the rotor.

    Where, say, an induction motor typically has a hollow rotor consisting
    of layered steel plates with an embedded copper or aluminum "squirrel
    cage" (a ring of bars around the perimeter, all shorted together at the
    top and bottom).

    The reluctance motor can use a solid steel rotor, with gaps machined in
    to control where magnetic flux will go.

    A typical BLDC motor either has a ring of permanent magnets, or
    alternating poles (from the top/bottom) with a central ring magnet.



    I had before imagined it should be possible to make a hybrid of a
    reluctance and induction rotor for intermediate effects; partly by
    filling the gaps in the reluctance rotor with aluminum in place of air.
    This could still operate synchronously, but could have better torque
    under load and less issue with drop out. If it drops below synchronous
    speed, it would instead induce eddy currents in the aluminum parts of
    the rotor; rather than the air being "basically useless". However,
    aluminum would still behave more like air as far as the magnetic flux
    lines are concerned.


    Though, some commercial designs had instead gone the other way,
    hybridizing the reluctance rotor with a BLDC rotor, and using (cheaper) ceramic magnets in place of rare-earth magnets (as typical in a BLDC).

    One variant here resembling a reluctance motor with a split rotor, with
    the top/bottom rotated relative to each other, and a central ceramic
    ring magnet. Though, I think this pushes it more into the BLDC category.



    Also common, on the AC side, are 440 and 208 3-phase.
    Many traditional AC induction motors operate on 440VAC 3-phase.
    A lot of traditional industrial machines were also 440VAC.


    There is some stuff I saw about electrostatic motors gaining popularity
    in some areas, but these tend to operate at high voltages but very
    little amperage. They are comparably weak compared with magnetic motors,
    but can be more energy efficient.


    What if, opposed to each computer using its own power-supply (from
    120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

    That makes little sense. If you're going to distribute power,
    distribute it as AC so you save one transformer.


    I never was in big datacenter, but heard that they prefer DC.


    DC -> DC allows higher conversion efficiency compared to AC.
    Higher voltage distribution also allows more efficiency.


    Higher voltage would be needed with DC vs AC, as DC is more subject to resistive losses. Though, more efficiency on the AC side would be
    possible by increasing line frequency, say, using 240Hz rather than
    60Hz; but don't want to push the frequency too high as then the wires
    would start working like antennas and radiating the power into space.

    A higher line frequency would increase the relative efficiency of
    electrical transformers. Higher voltage AC also has a higher conversion efficiency than lower voltage.

    In theory, assuming the AC comes in at 60Hz, could have a sort of rotary converter to boost the line frequency (could have a vaguely similar construction to an AC motor, but where input power uses 6 coils, and the output side has 12 or 24 coils; likely also operating like a boost transformer).

    Not sure if anyone already builds this, or the conversion efficiency of
    such a device. Would need to hopefully have a high conversion efficiency (otherwise it would not offset the losses in all of any smaller
    transformers).

    Though, wouldn't really gain anything if just going directly to DC via
    bridge rectifiers (with no intermediate transformers), and then using
    DC-DC conversion.


    So, say 1320VAC 3-phase could likely be rectified into 960VDC, where,
    assuming the presence of big capacitors, the voltage would drop slightly
    in conversion due to phase ripple (the "peaks" getting flattened out).

    Or, in theory, I have little idea where people would get diodes and
    capacitors big enough for this. Presumably giant industrial-sized diodes
    and capacitors could exist though (well, and/or PCBs with craptons of
    smaller components).

    Then again, in a relative sense, boards with 1000s of diodes and
    capacitors wouldn't cost much relative to the cost of the building and servers.

    ...




    Or, 2-stage, say:
    960V -> 192V (with 960V to each rack).
    192V -> 12V (with 192V to each server).

    Where the second stage drop could use slightly cheaper transistors,


    Transistors?

    Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
    transistors.



    Yes, pretty much.

    MOSFET, diode (from ground), inductor, and a capacitor;
    Then you need a controller circuit to keep track of the voltage and
    adjust the duty cycle as needed to maintain the target voltage.

    MOSFET lets power in, which goes through the coil, and charges the
    capacitor (in parallel with the load). When the MOSFET turns off, there
    is a voltage kick from the inductor (it goes negative), pulling power
    from the ground plane.


    It is possible to use an opamp for this (rather than a microcontoller),
    but an opamp would generate very crude PWM, thus, noisier.

    Possible noise reduction approaches:
    Big capacitor;
    Secondary inductor, diode, and capacitor.
    Assuming a constant load, a second inductor could smooth the PWM noise
    by maintaining closer to a constant current; but is more likely to see
    voltage ripples if there are sudden changes in the load (if compared
    with using a bigger capacitor).

    Comparably a microcontroller can generate an higher-frequency PWM
    signal, and keep the initial noise lower.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 26 23:35:52 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted: --------------------snip----------------------------------
    Higher voltage would be needed with DC vs AC, as DC is more subject to resistive losses. Though, more efficiency on the AC side would be
    possible by increasing line frequency, say, using 240Hz rather than
    60Hz; but don't want to push the frequency too high as then the wires
    would start working like antennas and radiating the power into space.

    The military routinely uses 400 Hz to reduce the weight of transformers.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 26 19:37:34 2025
    From Newsgroup: comp.arch

    On 9/26/2025 6:35 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted: --------------------snip----------------------------------
    Higher voltage would be needed with DC vs AC, as DC is more subject to
    resistive losses. Though, more efficiency on the AC side would be
    possible by increasing line frequency, say, using 240Hz rather than
    60Hz; but don't want to push the frequency too high as then the wires
    would start working like antennas and radiating the power into space.

    The military routinely uses 400 Hz to reduce the weight of transformers.

    OK, so it makes sense then...

    I guessed 240Hz as it could likely be enough to usefully boost
    efficiency, but not so high as to cause significant leakage from the building's electrical system.

    Something like 400 or 480Hz should also work.


    Moving too far into kHz territory is likely to result in significant
    leakage.

    Though, looking into it, would likely have to get pretty high into the
    kHz range before a buildings' power distribution system starts radiating
    most of the power into the environment (with most of the sub-kHz
    territory likely being pretty safe here).

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 27 08:14:11 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:
    On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    Brings up a thought: 960VDC is a semi-common voltage in industrial
    applications IIRC.

    I've never encountered that voltage. Direct current motors are
    also mostly being phased out (pun intended) by asynchronous motors
    with frequency inverters.


    Are you sure?
    Indeed, in industry, outside of transportation, asynchronous AC motors
    were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp. brushlees, AC sync, AC async) enjoy similar popularity.

    I can only speak from poersonal experience about the industry I
    work in (chemical). People used to use DC motors when they needed
    variable motor speed, but have now switched to asynchronous (AC)
    motors with frequency inverters, which usually have a 1:10 ratio
    of speed. There are no DC network in chemical plants.

    If you have high-voltage DC system (like in an electric car) then
    using DC motors makes more sense.


    What if, opposed to each computer using its own power-supply (from
    120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

    That makes little sense. If you're going to distribute power,
    distribute it as AC so you save one transformer.


    I never was in big datacenter, but heard that they prefer DC.

    Eventually, electronics requires DC. Of course, you can make
    an economic calculation of where you put your transformers and
    rectifiers, and where you want which voltage.

    An option which makes little sense is to have a rectifier which
    creates high-voltage DC, then distributes that, and to have
    an alternator at the other end to create AC which you can then
    transform down. It would be better to distribute AC and transform
    it down, saving two parts.



    Or, 2-stage, say:
    960V -> 192V (with 960V to each rack).
    192V -> 12V (with 192V to each server).

    Where the second stage drop could use slightly cheaper transistors,


    Transistors?

    Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
    transistors.

    I'm more used to thyristors in that role.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sat Sep 27 13:27:02 2025
    From Newsgroup: comp.arch

    On 26/09/2025 14:10, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    Brings up a thought: 960VDC is a semi-common voltage in industrial
    applications IIRC.

    I've never encountered that voltage. Direct current motors are
    also mostly being phased out (pun intended) by asynchronous motors
    with frequency inverters.


    True DC motors - with brushes - are rarely found outside of very small
    motors (where they are cheap and simple). But there are a wide variety
    of AC motors controlled in many different ways. Asynchronous AC motors
    are only one type. There are lots of other topologies for motors and
    their controllers, with different pros and cons and suitable applications.

    What if, opposed to each computer using its own power-supply (from 120
    or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

    That makes little sense. If you're going to distribute power,
    distribute it as AC so you save one transformer.


    There are lots of advantages of distributing power as DC. Transformers
    are only a good choice at higher voltages - once you get to the levels
    that can be handled well by semiconductor switches, they are smaller and
    more efficient, and work best for DC-to-DC. 1200V switches are cheap
    and common now, though there are devices that handle a few thousand
    volts. Electric car charger standards are 400V and 800V, with some new
    ones at 1000V or up to 1500V.

    It makes to distribute locally at something like 48V or 60V DC.
    Connections are simpler, you can take it directly from an UPS, and the
    local conversions to low voltage power lines is simpler than with 120V
    or 240V AC.

    So for a data centre, using perhaps 800V DC (taking advantage of the
    electric car industry standards) to the rack, then 48V DC to the devices
    on the rack would seem a good setup to me. DC also makes life much
    easier and more efficient when you have UPSs and battery backup -
    locally in a rack, or wider in the higher level supply.



    Or, 2-stage, say:
    960V -> 192V (with 960V to each rack).
    192V -> 12V (with 192V to each server).

    Where the second stage drop could use slightly cheaper transistors,

    Transistors?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sat Sep 27 13:52:23 2025
    From Newsgroup: comp.arch

    On 27/09/2025 10:14, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    Brings up a thought: 960VDC is a semi-common voltage in industrial
    applications IIRC.

    I've never encountered that voltage. Direct current motors are
    also mostly being phased out (pun intended) by asynchronous motors
    with frequency inverters.


    Are you sure?
    Indeed, in industry, outside of transportation, asynchronous AC motors
    were that most wide-spread motors by far up to 25-30 years ago. But my
    imressioon was that today various type of electric motors (DC, esp.
    brushlees, AC sync, AC async) enjoy similar popularity.

    I can only speak from poersonal experience about the industry I
    work in (chemical). People used to use DC motors when they needed
    variable motor speed, but have now switched to asynchronous (AC)
    motors with frequency inverters, which usually have a 1:10 ratio
    of speed. There are no DC network in chemical plants.

    If you have high-voltage DC system (like in an electric car) then
    using DC motors makes more sense.


    These are not "DC motors" in the traditional sense, like brushed DC
    motors. The motors you use in a car have (roughly) sine wave drive
    signals, generally 3 phases (but sometimes more). Even motors referred
    to as "Brushless DC motors" - "BLDC" - use AC inputs, though the
    waveforms are more trapezoidal than sinusoidal.

    And whenever you have a frequency inverter, the input to the frequency
    is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.

    Really, the distinction between "DC motor" and "AC motor" is mostly meaningless, other than for the smallest and cheapest (or oldest)
    brushed DC motors.

    Bigger brushed DC motors, as you say, used to be used in situations
    where you needed speed control and the alternative was AC motors driven
    at fixed or geared speeds directly from the 50 Hz or 60 Hz supplies.
    And as you say, these were replaced by AC motors driven from frequency inverters. Asynchronous motors (or "induction motors") were popular at
    first, but are not common choices now for most use-cases because
    synchronous AC motors give better control and efficiencies. (There are,
    of course, many factors to consider - and sometimes asynchronous motors
    are still the best choice.)




    Or, 2-stage, say:
    960V -> 192V (with 960V to each rack).
    192V -> 12V (with 192V to each server).

    Where the second stage drop could use slightly cheaper transistors,


    Transistors?

    Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
    transistors.

    I'm more used to thyristors in that role.


    It's better, perhaps, to refer to "semiconductor switches" as a more
    general term.

    Thyristors are mostly outdated, and are only used now in very high power situations. Even then, they are not your granddad's thyristors, but
    have more control for switching off as well as switching on - perhaps
    even using light for the switching rather than electrical signals.
    (Those are particularly nice for megavolt DC lines.)

    You can happily switch multiple MW of power with a single IGBT module
    for a could of thousand dollars. Or you can use SiC FETs for up to a
    few hundred kW but with much faster PWM frequencies and thus better control.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 27 12:38:14 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> schrieb:

    And whenever you have a frequency inverter, the input to the frequency
    is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.

    If you have three phases (required for high-power industrial motors)
    I believe people use the three phases directly to convert from three
    phases to three phases.

    The resulting waveforms are not pretty, and contribute to the
    difficulty of measuing power input.

    [...]
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sat Sep 27 15:15:41 2025
    From Newsgroup: comp.arch

    On 27/09/2025 14:38, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    And whenever you have a frequency inverter, the input to the frequency
    is first rectified to DC, then new AC waveforms are generated using PWM
    controlled semiconductor switches.

    If you have three phases (required for high-power industrial motors)
    I believe people use the three phases directly to convert from three
    phases to three phases.

    The resulting waveforms are not pretty, and contribute to the
    difficulty of measuing power input.


    That used to be how it was done - using thyristors, and powering
    induction motors. But it is not how it has been done in new motors for
    a long time. (In industrial use, some motors can be very big, very
    expensive, and very difficult to replace - thus factories can have the
    same motors for decades, even though better and more efficient ones are available.)

    Using thyristors to regulate the power out from your three phase input
    is relatively simple, but as you say, the waveforms are not pretty.
    This leads to significant noise (electrical and audible), vibrations,
    torque ripple, and wear and tear on the motor. And it makes a mess of
    the input supply, giving harmonics and phase differences between the
    current and voltage input - which leads to significant loses in the
    power delivery. These loses are between the generation and the
    customer, meaning the electricity supplier sees it but the customer does
    not see it on their bill - thus electricity suppliers greatly dislike
    it. The effect is less with thyristors on three phase power than
    thyristors on two phase power, but it is still very bad.

    So these days, the AC power - two phase or three phase - is invariably converted to DC first, using power factor correction rectification (so
    that the instantaneous current draw is proportional to the voltage at
    the time, keeping current and voltage in phase and nicely sinusoidal).
    After that, the AC drive to the motor is generated using PWM signals -
    from perhaps 2 or 4 kHz for old IGBT systems to at least 20 kHz for
    newer systems (avoiding audible noise) or up to maybe 160 kHz using GaN
    or SiC FETs - higher frequencies mean smaller and lighter inductors and capacitors.

    These kinds of motor control are smaller, more efficient, and much more controllable than old thyristor-based drives.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Sep 27 14:56:06 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 9/26/2025 9:28 AM, Scott Lurndal wrote:


    In those datacenters, the UPS distributes 48VDC to the rack components
    (computers, network switches, storage devices, etc).



    48VDC also makes sense, as it is common in other contexts. I sorta
    figured a higher voltage would have been used to reduce the wire
    thickness needed.

    This is within a 19" rack.



    I did realize after posting that, if the main power rails were organized
    as a grid, the whole building could be done probably with 1.25" aluminum >bars.

    The Burroughs V5x0 series ECL machines had Aluminum bus-bars.

    Spectacular failure mode when/if something conductive (screwdriver,
    wrench) was dropped across the hot and ground bars.


    Could power the grid of bars at each of the 4 corners, with maybe some >central diagonal bars (which cross and intersect with the central part
    of the grid, and an additional square around the perimeter). Each corner >supply could drive 512A, and with this layout, no bar or segment should >exceed 128A.

    In the old mainframe days, there would be large bus-bars (in an enclosure) across the ceiling and plug-in tap boxes would drop power to the
    various mainframe units.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Sep 27 14:56:47 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    BGB <cr88192@gmail.com> posted: >--------------------snip----------------------------------
    Higher voltage would be needed with DC vs AC, as DC is more subject to
    resistive losses. Though, more efficiency on the AC side would be
    possible by increasing line frequency, say, using 240Hz rather than
    60Hz; but don't want to push the frequency too high as then the wires
    would start working like antennas and radiating the power into space.

    The military routinely uses 400 Hz to reduce the weight of transformers.

    IBM mainframes used 400hz (via a motor-generator set).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch on Sat Sep 27 08:57:44 2025
    From Newsgroup: comp.arch

    On 9/26/25 5:37 PM, BGB wrote:

    Something like 400 or 480Hz should also work.

    Would y'all please change the subject line.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Sep 27 14:23:22 2025
    From Newsgroup: comp.arch

    On 9/27/2025 6:52 AM, David Brown wrote:
    On 27/09/2025 10:14, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    Brings up a thought: 960VDC is a semi-common voltage in industrial
    applications IIRC.

    I've never encountered that voltage.  Direct current motors are
    also mostly being phased out (pun intended) by asynchronous motors
    with frequency inverters.


    Are you sure?
    Indeed, in industry, outside of transportation, asynchronous AC motors
    were that most wide-spread  motors by far up to 25-30 years ago. But my >>> imressioon was that today various type of electric motors (DC, esp.
    brushlees, AC sync, AC async) enjoy similar popularity.

    I can only speak from poersonal experience about the industry I
    work in (chemical).  People used to use DC motors when they needed
    variable motor speed, but have now switched to asynchronous (AC)
    motors with frequency inverters, which usually have a 1:10 ratio
    of speed.  There are no DC network in chemical plants.

    If you have high-voltage DC system (like in an electric car) then
    using DC motors makes more sense.


    These are not "DC motors" in the traditional sense, like brushed DC motors.  The motors you use in a car have (roughly) sine wave drive signals, generally 3 phases (but sometimes more).  Even motors referred
    to as "Brushless DC motors" - "BLDC" - use AC inputs, though the
    waveforms are more trapezoidal than sinusoidal.


    Yes.

    Typically one needs to generate a 3-phase waveform at the speed they
    want to spin the motor at.



    I had noted in some experience when writing some code to spin motors (typically on an MSP430, mostly experimentally) or similar:
    Sine waves give low noise, but less power;
    Square waves are noisier and only work well at low RPM,
    but have higher torque.
    Sawtooth waves seem to work well at higher RPMs.
    Well, sorta, more like sawtooth with alternating sign.
    Square-Root Sine: Intermediate between sign and square.
    Gives torque more like a square wave, but quieter.
    Trapezoid waves are similar to this, but more noise.

    Seemingly, one "better" option might be to mutate the wave-shape between Square-Root-Sine and sawtooth depending on the target RPM. Also dropping
    the wave amplitude at lower RPMs (at low RPMs motors pull more amperage
    and thus generate a lot of heat otherwise).

    In this case, the sawtooth wave helps because the coils don't like
    changing quickly, so in this case one hits them full power at the start
    (to get them going) and then rapidly drop back down to zero, then hit
    them the same way on the opposite sign for the next part of the wave.


    When I was messing around with it at the time, input control signals
    were typically one of:
    ADC input connected to a POT (for direct control);
    1-2ms RC style PWM.

    Step/Dir signaling (typical for stepper drivers and servomotor
    controllers) could also make sense.

    One other option is dual-phase motors, which have the partial advantage
    that one can use a repurposed stepper driver. In this case typically set
    to micro-stepping. A lot of the dual-phase motors in this case though
    were built from repurposed capacitor-run split-phase motors.

    Say, for example, one can be like, "Yeah, this AC split phase motor is
    close enough to being a NEMA34 stepper...".

    Typically need to partly rewire it as typically the split phase motors
    have 3 wires, but need 4 wire in this case. Some other motors are easily modified into 3-phase though (with the same coils as a 3-phase motor internally, just wired into a split-phase configuration with a 60-degree
    phase offset; vs 90 degrees in some other motors).


    One can get different properties if they machine a new rotor, as these
    motors invariably come with squirrel-cage rotors. Easier/cheaper to
    machine here being a reluctance rotor.

    Main annoyance mostly being that this can be a pretty big chunk of steel
    for any non-trivial motor (also heavy). Could likely reduce weight by
    making the base rotor by layering multiple sizes of steel tubing, then
    brazing or welding it all together with some steel end-caps (drilled out
    for the motor shaft, probably also brazed in place). Then turn it to the target diameter, and mill the side grooves.

    Well, or find something with sufficiently thick walls (say, a chunk of
    5" OD, schedule 120 or 180 steel pipe). This would simplify the process,
    and be cheaper (and lighter) than, say, a chunk of 5" bar stock.


    Haven't done much in this area for a while, was mostly messing around a
    lot more with this when I was a little younger.


    And whenever you have a frequency inverter, the input to the frequency
    is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.


    Yes:
    Dual-phase: may use a "Dual H-Bridge" configuration
    Where, the H-bridge is built using power transistors;
    Three-phase: "Triple Half-Bridge"
    Needs fewer transistors than dual phase.

    It is slightly easier to build these drivers with BJTs or Darlington transistors, but these tend to handle less power and generate more heat,
    but are more fault tolerant.


    MOSFETs can handle more power, but one needs to be very careful not to
    exceed the Gain-Source voltage limit, otherwise they are insta-dead (and
    will behave as if they are shorted).

    So, one needs a more complex circuit, say:
    MOSFET power transistor (typically NMOS);
    NPN or PNP control transistor (such as a 2N3904 or similar);
    Pull up/down resistors;
    Zener diode.
    In which case the control transistors can be driven as in a typical
    H-Bridge.

    Say, for example:
    Pull down resistor pulls Gate to Source, keeping it off by default;
    Zener diode in parallel with resistor, to impose VGS limit;
    Pull-up transistor connects to Drain via a resistor
    (via emitter or collector, depending on PNP or NPN).
    Base on control transistor used for control.

    Then can control the MOSFETs as-if they were BJTs. Not sure why they
    can't have this stuff built-in (sort of like with a Darlington), but alas.

    One typically also needs flyback diodes, and a main DC rail capacitor,
    and a DC rail zener diode, ...


    Though, at this stage, more preferable to buy these things than build
    them, as the hand-built ones tend to have a bad habit of exploding.



    Really, the distinction between "DC motor" and "AC motor" is mostly meaningless, other than for the smallest and cheapest (or oldest)
    brushed DC motors.


    Pretty much.

    More the motor technology one finds in toys and a lot of cordless power
    tools. Also the "Power Wheels" vehicles, which tended to use the same
    kind of 1/4 HP brushed DC motors often found in cordless power tools.

    Some adults had ridden around on these things, but sometimes modded them
    out to use bigger 1/2 or 3/4 HP motors. Typically also need a bigger
    battery, as they were using repurposed UPS batteries (from one I ended
    up tearing down some years ago). Otherwise, mostly all plastic apart
    from a steel axle and similar.


    Bigger brushed DC motors, as you say, used to be used in situations
    where you needed speed control and the alternative was AC motors driven
    at fixed or geared speeds directly from the 50 Hz or 60 Hz supplies. And
    as you say, these were replaced by AC motors driven from frequency inverters.  Asynchronous motors (or "induction motors") were popular at first, but are not common choices now for most use-cases because
    synchronous AC motors give better control and efficiencies.  (There are,
    of course, many factors to consider - and sometimes asynchronous motors
    are still the best choice.)


    Yeah.

    Large brushed DC motors are not usually seen much IME.


    Have encountered brushed DC motors up to around 1/2 or 3/4 HP, not sure
    if they go much larger.

    They often go at lower RPMs, say (IIRC):
    1/4 HP: 20000 RPM (roughly 1.25" OD x 2" L)
    1/2 HP: 10000 RPM (roughly 2.5" OD x 4" L)
    3/4 HP: 6000 RPM (roughly 4" OD x 6" L)

    As well as typically being physically larger (though, a 3/4 HP brushed
    DC motor is merely the size of a 1/4 HP AC induction motor). Like, very
    large by DC motor standards, but by AC motor standards, smaller than the motors typically used to spin the fan blades on an air conditioner unit.

    Whereas, a 3/4 HP induction motor is a much bigger beast.


    BLDC motors are typically also small. But, pure BLDC motors are also
    often very expensive much over 1/4 HP (often because they use neodymium magnets).

    But, the other option being reluctance motors, but these may or may not
    be passed off as BLDC.

    Can sort of tell the difference when spinning them with no power
    applied: True BLDCs will have high "cogging torque" (almost more like a stepper motor, but not as strong and with much bigger steps);
    If there is a very weak coging torque, it is likely one of the
    intermediate reluctance/BLDC hybrids (eg, with a ceramic ring magnet);
    If it spins freely (no cogging torque) it is likely a reluctance motor.





    Or, 2-stage, say:
        960V -> 192V (with 960V to each rack).
        192V ->  12V (with 192V to each server).

    Where the second stage drop could use slightly cheaper transistors,

    Transistors?

    Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
    transistors.

    I'm more used to thyristors in that role.


    It's better, perhaps, to refer to "semiconductor switches" as a more
    general term.

    Thyristors are mostly outdated, and are only used now in very high power situations.  Even then, they are not your granddad's thyristors, but
    have more control for switching off as well as switching on - perhaps
    even using light for the switching rather than electrical signals.
    (Those are particularly nice for megavolt DC lines.)

    You can happily switch multiple MW of power with a single IGBT module
    for a could of thousand dollars.  Or you can use SiC FETs for up to a
    few hundred kW but with much faster PWM frequencies and thus better
    control.


    Yes.

    For medium power, typically MOSFETs were used.

    For low power, typically BJTs or Darlingtons.

    But, BJTs seem to become impractical much over around 60V 5A or so. Even
    this requires a pretty aggressive heat-sink and/or active cooling.


    MOSFETs handle more power with less heat, and are often available up to
    around 1000V 50A or so (in TO-247 packaging or similar), but can be run
    in parallel as needed for more amps.

    IGBTs for when one needs something big...


    Never really messed with Thyristors.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Sep 28 12:00:56 2025
    From Newsgroup: comp.arch

    On Fri, 26 Sep 2025 14:28:02 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    BGB <cr88192@gmail.com> writes:
    On 9/25/2025 9:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:



    Consider that there are losses converting from the
    primary (e.g. 22kv) to 480v (2%), and additional losses
    converting to 208v (3%) to the UPS. That's before any
    rectification losses (6% to 12%). With various optimizations,
    they reduced total losses to 7.5%, including rectification
    and transformation from the primary voltage.


    Hmm...

    Brings up a thought: 960VDC is a semi-common voltage in industrial >applications IIRC.

    What if, opposed to each computer using its own power-supply (from
    120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.


    In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).

    I looked at PSUs offered by Dell for their rack servers. There are four
    options for the inputs, although not every server model has all four.
    The options are:
    - 100-240 VAC.
    - 200-240 VAC
    - -48 VDC
    - 240-400 VDC

    I don't know in which countries and in which branch of IT they prefer
    the fourth option, but knowing Dell of late (as opposed to Dell of up to ~2008), they would not offer the option unless demand was quite
    significant.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sun Sep 28 16:44:02 2025
    From Newsgroup: comp.arch

    On 27/09/2025 21:23, BGB wrote:
    On 9/27/2025 6:52 AM, David Brown wrote:
    On 27/09/2025 10:14, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    Brings up a thought: 960VDC is a semi-common voltage in industrial >>>>>> applications IIRC.

    I've never encountered that voltage.  Direct current motors are
    also mostly being phased out (pun intended) by asynchronous motors
    with frequency inverters.


    Are you sure?
    Indeed, in industry, outside of transportation, asynchronous AC motors >>>> were that most wide-spread  motors by far up to 25-30 years ago. But my >>>> imressioon was that today various type of electric motors (DC, esp.
    brushlees, AC sync, AC async) enjoy similar popularity.

    I can only speak from poersonal experience about the industry I
    work in (chemical).  People used to use DC motors when they needed
    variable motor speed, but have now switched to asynchronous (AC)
    motors with frequency inverters, which usually have a 1:10 ratio
    of speed.  There are no DC network in chemical plants.

    If you have high-voltage DC system (like in an electric car) then
    using DC motors makes more sense.


    These are not "DC motors" in the traditional sense, like brushed DC
    motors.  The motors you use in a car have (roughly) sine wave drive
    signals, generally 3 phases (but sometimes more).  Even motors
    referred to as "Brushless DC motors" - "BLDC" - use AC inputs, though
    the waveforms are more trapezoidal than sinusoidal.


    Yes.

    Typically one needs to generate a 3-phase waveform at the speed they
    want to spin the motor at.


    Details of motor drives is perhaps getting a bit OT for this group - but
    there are people here interested in all sorts of things. If you want to
    have more discussions on motor drives, comp.arch.embedded might be a
    nice place for a new thread - the group appears fairly empty, but
    experts crawl out of the woodwork whenever an interesting new thread is started!



    I had noted in some experience when writing some code to spin motors (typically on an MSP430, mostly experimentally) or similar:

    Experiments are always good, but it is also helpful to combine them with
    a bit of theory so that you don't generalise too much from a small
    number of tests. In particular, the motor windings in a three phase AC
    motor can be done in several different ways, optimised for different
    kinds of controlling waves. The two main ones for small and medium
    permanent magnet motors are for sinusoidal waves (aiming for smoothest
    and most controlled driving - often called "PMSM - permanent magnet synchronous motors") and for trapezoidal driving (for simpler driving,
    often referred to as "BLDC - Brushless DC").

    Then there are different ways to track the position of the motor. You
    can have hall effect sensors, which are simple and cheap, giving 6
    positions per electrical rotation (motors can have multiple sets of
    windings and magnets, giving two or more electrical rotations per
    mechanical rotation). These are good for trapezoidal BLDC control. It
    is also possible to use sensorless control, where the hall effect
    signals are calculated by measuring the back EMF from the motor windings during the off periods of the driver half bridges. This avoids the
    sensors and can make cabling easier, but can't be used at low speed - it
    is only suitable for continuously running motors rather than positioning motors.

    Or you can have encoders, which give the more precise position needed
    for sine wave or PMSM waves. These are usually quadrature encoders,
    which are accurate and reliable but need to pass through an index
    position to get their absolute position. Sometimes absolute encoders
    are used - these are either cheaper but less precise using analogue hall effect sensors, or much more expensive using multiple Grey code rings
    with optical or inductive sensing.

    For trapezoidal drives, you usually have a simple 6-step switching
    sequence, with each of the three half-bridges driving high for 2 steps,
    off for 1 step, low for 2 steps, off for one step. You can control the
    speed of the motor by the speed of the steps, and the power by using PWM modulation when driving high or low (or by using a single PWM control
    for the common DC bus voltage).

    For sine wave driving, you need fast PWM for each of the three half
    bridges to generate three sine waves at 120° phase differences. The PWM frequency has to be high enough so that after the filtering in the motor windings, you have little in the way of harmonics.

    Generally, however, instead of actively producing sine waves, you do
    what is known as "vector control" - you measure the currents in the
    three branches, and use the angle data to convert these to currents perpendicular to and aligned to the motor position. You then regulate
    the PWM values to control these two currents - aiming to get the desired current in the active direction, and zero current perpendicular to it
    (since that is just wasted effort). The resulting waveforms are
    somewhat distorted sine waves.


    An msp430 is fine for trapezoidal control and hall effect sensors, but a
    bit underpowered for serious sine wave or vector control. You are
    better with a Cortex-M4 for motors.


      Sine waves give low noise, but less power;

    Sine waves are closer to the ideal for many motors, but you'll get even
    lower noise with good vector control.

    You can also try adding some third harmonic - use sin(x) + 1/9 sin(3x).
    The third harmonic disappears in the motor, since it affects all three
    phases equally. But it flattens out the peaks of the sine wave and lets
    you then increase the amplitude by about 12.5% before hitting 100% of
    your DC bus voltage.


      Square waves are noisier and only work well at low RPM,
        but have higher torque.

    Square waves are a really bad idea - you jump between high torque and
    low torque, and will regularly be pulling the motor back a bit rather
    than forwards. Prefer trapezoidal control - it is just as easy, and
    works vastly better. You of course get more torque ripple than with
    sine waves or vector control.

      Sawtooth waves seem to work well at higher RPMs.
        Well, sorta, more like sawtooth with alternating sign.

    Do you mean trapezoidal control?

      Square-Root Sine: Intermediate between sign and square.
        Gives torque more like a square wave, but quieter.

    That's just weird. I think what you are seeing is something similar to
    the shape you get from vector control.

        Trapezoid waves are similar to this, but more noise.

    Seemingly, one "better" option might be to mutate the wave-shape between Square-Root-Sine and sawtooth depending on the target RPM. Also dropping
    the wave amplitude at lower RPMs (at low RPMs motors pull more amperage
    and thus generate a lot of heat otherwise).


    Of course you have lower average voltage at lower speeds and torques -
    that's why you use PWM to control your wave amplitudes.



    And whenever you have a frequency inverter, the input to the frequency
    is first rectified to DC, then new AC waveforms are generated using
    PWM controlled semiconductor switches.


    Yes:
      Dual-phase: may use a "Dual H-Bridge" configuration
        Where, the H-bridge is built using power transistors;
      Three-phase: "Triple Half-Bridge"
        Needs fewer transistors than dual phase.

    It is slightly easier to build these drivers with BJTs or Darlington transistors, but these tend to handle less power and generate more heat,
    but are more fault tolerant.


    MOSFETs can handle more power, but one needs to be very careful not to exceed the Gain-Source voltage limit, otherwise they are insta-dead (and will behave as if they are shorted).


    FETs are always used (until voltage or current requirements force you to
    use IGBTs) - no one uses BJTs or Darlingtons in real motor control. You
    need an appropriate gate driver for the FETs, but those are common and
    cheap, and usually combined with deadtime control to avoid accidental shoot-thrown when you enable the high side and low side together.
    Modules that combine all this with three half-bridges are also small and cheap.


    --- Synchronet 3.21a-Linux NewsLink 1.2