• Re: Efficiency of in-order vs. OoO

    From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 18:23:50 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:

    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    My 66000 Architecture defines 8 performance counters at each layer of
    the design:: cores gets 8 counters, L1s gets 8 counters, L3s gets 8
    counters Interconnect gets 8 counters, Memory Controller gets 8 counters, >PCIe root gets 8 counters--and every instance multiplies the counters.

    It's not really the number of counters that is important, rather
    it is what the counters count (i.e. which events can be counted).

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Mar 25 18:35:35 2024
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world
    programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Mon Mar 25 14:33:44 2024
    From Newsgroup: comp.arch

    On 3/25/2024 1:35 PM, Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.


    Odd...

    I had mostly skipped any performance counters in hardware, and was
    instead using an emulator to model performance (among other things), but
    for performance tuning this only works so much as the emulator remains accurate in terms of cycle costs (I make an effort, but seems to vary).


    One annoyance is that trying to model some newer or more advanced
    features may bog down the emulator enough that it can't maintain
    real-time performance.

    Though, I guess it is likely that for a "not painfully slow" processor
    (like an A55 or similar) cycle-accurate emulation in real-time at the
    narive clock speed may not be viable (one would burn way to many cycles
    trying to model things like the cache hierarchy and branch predictor, ...).


    Some amount of debugging and performance measurements are possible via
    "LED" outputs, which show the status of pipeline stalls and the signals
    that feed into these stalls (and, in directly, the percentage of time
    spent running instructions via the absence of stalls, ...), ...

    Had generated a cycle-use ranking for the full simulation by having the testbench code run checks mostly on these LED outputs (vs looking at
    them directly).

    Runs on an actual FPGA are admittedly comparably infrequent.


    Though, ironically, have noted that things like shell commands, etc, can
    still be fairly responsive even for Verilog simulations effectively
    running in kHz territory (where, good responsiveness is sometimes a
    struggle even for modern PCs running Windows).

    Or, having recently been working on a tool, and due to some combination
    of factors, at one point in the testing kept taking around 20 seconds
    each time for process creation, which was rather annoying (because
    seemingly Windows would just upload the whole binary to the internet,
    then wait for a response, before letting it run).

    Seemingly, something about the tool was apparently triggering "Windows Defender SmartShield" or similar; it never gave any warnings/messages
    about it, merely caused a fairly long/annoying delay whenever
    relaunching the tool. Then just magically went away (after one of my
    secondary UPS's had "let the smoke out" and also the ISP had partly went
    down for a while; could see ISP local IPs but access to the wider
    internet was seemingly disrupted, ... Like, seemingly, a "the ghosts in
    the machine are not happy right now" type event).

    The tool itself was mostly writing something sorta like SFTP but instead
    for working with disk images. Starting to want to revisit the filesystem question, but looking back at NTFS, still don't really want to try to implement an NTFS driver.

    Possibly EXT2/3/4 would be an option, apart from the annoyance that
    Windows can't access it, so I would still be little better off than just rolling my own (and trying to have the core design be hopefully not
    needlessly complicated).

    ...


    - anton

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Mon Mar 25 20:22:00 2024
    From Newsgroup: comp.arch

    In article <2024Mar25.193535@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel
    or AMD.

    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect
    most of the latter to want those features so that they can understand the performance of their silicon better.

    John
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Mar 25 21:42:18 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 20:46:39 2024
    From Newsgroup: comp.arch

    jgd@cix.co.uk (John Dallman) writes:
    In article <2024Mar25.193535@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel
    or AMD.

    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

    The biggest demand is from the OS vendors. Hardware folks have
    simulation and emulators.

    Look at vtune, for example.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 20:48:08 2024
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world
    programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so >instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 09:22:31 2024
    From Newsgroup: comp.arch

    jgd@cix.co.uk (John Dallman) writes:
    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

    That might explain why for the AmLogic S922X in the Odroid N2/N2+
    there is a Linux 4.9 kernel that supports performance monitoring
    counters (AmLogic put that in for their own uses), but the mainline
    Linux kernel does not support perf on the S922X (perf was not in the requirements of whoever integrated the S922X stuff into the mainline).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 09:27:54 2024
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have
    simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a long-running program.

    Look at vtune, for example.

    And?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Mar 26 10:47:07 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so
    instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.


    Thanks!

    JTAG was indeed the term as was looking for (and not remembering). Maybe
    I'm getting old?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Mar 26 14:15:41 2024
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 16:47:02 2024
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >>long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, >bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    I OTOH expect that designers of out-of-order (and in-order) cores
    analyse the performance of various programs to find out where the
    bottlenecks of their microarchitectures are in benchmarks and
    applications that people look at to determine which CPU to buy. And
    that's why we not only just have PMCs for memory accesses, but also
    for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Tue Mar 26 17:29:00 2024
    From Newsgroup: comp.arch

    In article <2024Mar26.174702@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    There can be considerable confusion on this point. In the early days of
    Intel VTune, it would only work on small and simple programs, but Intel
    sent one of the lead developers to visit the UK with it, expecting that
    it would instantly find huge speed-ups in my employers' code.

    What happened was that VTune crashed almost instantly when faced with
    something that large, and Intel learned about the difference between microarchitecture analysis and application analysis.

    John
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Mar 26 18:47:38 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>>>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >>>long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, >>bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    I OTOH expect that designers of out-of-order (and in-order) cores
    analyse the performance of various programs to find out where the
    bottlenecks of their microarchitectures are in benchmarks and
    applications that people look at to determine which CPU to buy. And
    that's why we not only just have PMCs for memory accesses, but also
    for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

    Quit being so CPU-centric.

    You also need measurement on how many of which transactions few across
    the bus, DRAM use analysis, and PCIe usage to fully tune the system.

    - anton
    --- Synchronet 3.20a-Linux NewsLink 1.114