Forum: War Ensemble BBS

Re: Efficiency of in-order vs. OoO

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 18:23:50 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

On 3/24/24 4:39 PM, Scott Lurndal wrote:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

My 66000 Architecture defines 8 performance counters at each layer of
the design:: cores gets 8 counters, L1s gets 8 counters, L3s gets 8
counters Interconnect gets 8 counters, Memory Controller gets 8 counters, >PCIe root gets 8 counters--and every instance multiplies the counters.

It's not really the number of counters that is important, rather
it is what the counters count (i.e. which events can be counted).

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Mar 25 18:35:35 2024

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.

My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world
programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Mon Mar 25 14:33:44 2024

From Newsgroup: comp.arch

On 3/25/2024 1:35 PM, Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.

My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.

Odd...

I had mostly skipped any performance counters in hardware, and was
instead using an emulator to model performance (among other things), but
for performance tuning this only works so much as the emulator remains accurate in terms of cycle costs (I make an effort, but seems to vary).

One annoyance is that trying to model some newer or more advanced
features may bog down the emulator enough that it can't maintain
real-time performance.

Though, I guess it is likely that for a "not painfully slow" processor
(like an A55 or similar) cycle-accurate emulation in real-time at the
narive clock speed may not be viable (one would burn way to many cycles
trying to model things like the cache hierarchy and branch predictor, ...).

Some amount of debugging and performance measurements are possible via
"LED" outputs, which show the status of pipeline stalls and the signals
that feed into these stalls (and, in directly, the percentage of time
spent running instructions via the absence of stalls, ...), ...

Had generated a cycle-use ranking for the full simulation by having the testbench code run checks mostly on these LED outputs (vs looking at
them directly).

Runs on an actual FPGA are admittedly comparably infrequent.

Though, ironically, have noted that things like shell commands, etc, can
still be fairly responsive even for Verilog simulations effectively
running in kHz territory (where, good responsiveness is sometimes a
struggle even for modern PCs running Windows).

Or, having recently been working on a tool, and due to some combination
of factors, at one point in the testing kept taking around 20 seconds
each time for process creation, which was rather annoying (because
seemingly Windows would just upload the whole binary to the internet,
then wait for a response, before letting it run).

Seemingly, something about the tool was apparently triggering "Windows Defender SmartShield" or similar; it never gave any warnings/messages
about it, merely caused a fairly long/annoying delay whenever
relaunching the tool. Then just magically went away (after one of my
secondary UPS's had "let the smoke out" and also the ISP had partly went
down for a while; could see ISP local IPs but access to the wider
internet was seemingly disrupted, ... Like, seemingly, a "the ghosts in
the machine are not happy right now" type event).

The tool itself was mostly writing something sorta like SFTP but instead
for working with disk images. Starting to want to revisit the filesystem question, but looking back at NTFS, still don't really want to try to implement an NTFS driver.

Possibly EXT2/3/4 would be an option, apart from the annoyance that
Windows can't access it, so I would still be little better off than just rolling my own (and trying to have the core design be hopefully not
needlessly complicated).

...

- anton

--- Synchronet 3.20a-Linux NewsLink 1.114

From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Mon Mar 25 20:22:00 2024

From Newsgroup: comp.arch

In article <2024Mar25.193535@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel
or AMD.

The question is if "users" to ARM Holdings are actual end-users, or the
SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect
most of the latter to want those features so that they can understand the performance of their silicon better.

John
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Mar 25 21:42:18 2024

From Newsgroup: comp.arch

Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.

My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.

Having reverse engineered the original Pentium EMON counters I got a
meeting with Intel about their next cpu (the PentiumPro), what I was
told about the Pentium was that this chip was the first one which was
too complicated to create/sell an In-Circuit Emulator (ICE) version, so instead they added a bunch of counters for near-zero overhead monitoring
and depended on a bit-serial read-out when they needed to dump all state
for debugging. (I have forgotten the proper term for that interface! :-( )

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 20:46:39 2024

From Newsgroup: comp.arch

jgd@cix.co.uk (John Dallman) writes:

In article <2024Mar25.193535@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel
or AMD.

The question is if "users" to ARM Holdings are actual end-users, or the
SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

The biggest demand is from the OS vendors. Hardware folks have
simulation and emulators.

Look at vtune, for example.
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 20:48:08 2024

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.

Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.

My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world
programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.

Having reverse engineered the original Pentium EMON counters I got a
meeting with Intel about their next cpu (the PentiumPro), what I was
told about the Pentium was that this chip was the first one which was
too complicated to create/sell an In-Circuit Emulator (ICE) version, so >instead they added a bunch of counters for near-zero overhead monitoring
and depended on a bit-serial read-out when they needed to dump all state
for debugging. (I have forgotten the proper term for that interface! :-( )

Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 09:22:31 2024

From Newsgroup: comp.arch

jgd@cix.co.uk (John Dallman) writes:

The question is if "users" to ARM Holdings are actual end-users, or the
SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

That might explain why for the AmLogic S922X in the Odroid N2/N2+
there is a Linux 4.9 kernel that supports performance monitoring
counters (AmLogic put that in for their own uses), but the mainline
Linux kernel does not support perf on the S922X (perf was not in the requirements of whoever integrated the S922X stuff into the mainline).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 09:27:54 2024

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

The biggest demand is from the OS vendors. Hardware folks have
simulation and emulators.

You don't want to use a full-blown microarchitectural emulator for a long-running program.

Look at vtune, for example.

And?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Mar 26 10:47:07 2024

From Newsgroup: comp.arch

Scott Lurndal wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Having reverse engineered the original Pentium EMON counters I got a
meeting with Intel about their next cpu (the PentiumPro), what I was
told about the Pentium was that this chip was the first one which was
too complicated to create/sell an In-Circuit Emulator (ICE) version, so
instead they added a bunch of counters for near-zero overhead monitoring
and depended on a bit-serial read-out when they needed to dump all state
for debugging. (I have forgotten the proper term for that interface! :-( )

Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.

Thanks!

JTAG was indeed the term as was looking for (and not remembering). Maybe
I'm getting old?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Mar 26 14:15:41 2024

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The biggest demand is from the OS vendors. Hardware folks have >>simulation and emulators.

You don't want to use a full-blown microarchitectural emulator for a >long-running program.

Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.

Their target is not application analysis.
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 16:47:02 2024

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The biggest demand is from the OS vendors. Hardware folks have >>>simulation and emulators.

You don't want to use a full-blown microarchitectural emulator for a >>long-running program.

Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, >bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.

Their target is not application analysis.

This sounds like hardware folks that are only concerned with
memory-bound programs.

I OTOH expect that designers of out-of-order (and in-order) cores
analyse the performance of various programs to find out where the
bottlenecks of their microarchitectures are in benchmarks and
applications that people look at to determine which CPU to buy. And
that's why we not only just have PMCs for memory accesses, but also
for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Tue Mar 26 17:29:00 2024

From Newsgroup: comp.arch

In article <2024Mar26.174702@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

Their target is not application analysis.

This sounds like hardware folks that are only concerned with
memory-bound programs.

There can be considerable confusion on this point. In the early days of
Intel VTune, it would only work on small and simple programs, but Intel
sent one of the lead developers to visit the UK with it, expecting that
it would instantly find huge speed-ups in my employers' code.

What happened was that VTune crashed almost instantly when faced with
something that large, and Intel learned about the difference between microarchitecture analysis and application analysis.

John
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Mar 26 18:47:38 2024

From Newsgroup: comp.arch

Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The biggest demand is from the OS vendors. Hardware folks have >>>>simulation and emulators.

You don't want to use a full-blown microarchitectural emulator for a >>>long-running program.

Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, >>bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.

Their target is not application analysis.

This sounds like hardware folks that are only concerned with
memory-bound programs.

I OTOH expect that designers of out-of-order (and in-order) cores
analyse the performance of various programs to find out where the
bottlenecks of their microarchitectures are in benchmarks and
applications that people look at to determine which CPU to buy. And
that's why we not only just have PMCs for memory accesses, but also
for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

Quit being so CPU-centric.

You also need measurement on how many of which transactions few across
the bus, DRAM use analysis, and PCIe usage to fully tune the system.

- anton

--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Grey Gamer
  Wed May 1 15:05:58 2024
  from Show Low, Az via Telnet
- Grey Gamer
  Wed May 1 11:25:27 2024
  from Show Low, Az via Telnet
- Grey Gamer
  Thu May 2 14:56:11 2024
  from Show Low, Az via Telnet
- Microbot
  Thu May 2 04:34:00 2024
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	915
Nodes:	10 (2 / 8)
Uptime:	43:03:31
Calls:	12,170
Calls today:	2
Files:	186,521
Messages:	2,234,528

Re: Efficiency of in-order vs. OoO

Who's Online

Recent Visitors

System Info