Forum: War Ensemble BBS

Re: 88xxx or PPC

From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sat Apr 20 15:15:04 2024

From Newsgroup: comp.arch

On 3/8/24 11:17 PM, MitchAlsup1 wrote:

Paul A. Clayton wrote:

[snip]

Register windows were intended to avoid save/restore overhead by
retaining values in registers with renaming. A stack cache is
meant to reduce the overhead of loads and stores to the stack —
not just preserving and restoring registers. A direct-mapped stack
cache is not entirely insane. A partial stack frame cache might
cache up to 256 bytes (e.g.) with alternating frames indexing with
inverted bits (to reduce interference) — one could even reserve a
chunk (e.g., 64 bytes) of a frame and not overlapped by limiting
offset cached to be smaller than the cache.

Such might be more useful than register windows, but that does
not mean that it is actually a good option.

If it is such a good option why has it not reached production ??

(might be) more useful than register windows is not the same as
providing a net benefit when considering the entire system.

One obvious issue with a small stack cache is utilization. While
generic data caches also have utilization issues (no single size
is ideal for all workloads) and the stack cache would be small
(and potentially highly prefetchable), the spilling and filling
overhead at entering and exiting stack frames could be much
greater than the savings from simple addressing (and permission
checks) if few accesses are made within the cached part of the
stack frame between frame spills and fills.

A latency optimized partial frame stack cache would also benefit
from specific sizes of higher utilization regions of stack frames
with longish frame active periods, so compiler-based optimization
would be a factor. Depending on microarchitecture-specific
compiler optimization for good performance is generally avoided.
This is related to software distribution format. If aliasing was
not avoided by architectural contract — which would be difficult
for any existing ISA — then handling aliases would also introduce
overhead. (For higher utilization, one might want to avoid caching
the registers saved at function entry, assuming these are colder
and less latency sensitive than other values in the frame. Since
the amount of the frame used by saved registers would vary, a
hardware-friendly fixed uncached chunk would either waste capacity
on cold saved registers when more registers are saved or make some
potentially warmer values uncached [in the stack cache]. Updating
the stack pointer to hide saved register would address this but
would presumably introduce other issues.)

Another factor that would reduce the attractiveness of specialized
caches is the use of out-of-order execution. OoOE helps hide
latency, so any latency benefit is less important.

Not all optimization opportunities are implemented even when they
do not conflict excessively. Part of this is the complexity and
risks of adding new features.

On 3/6/24 3:00 PM, MitchAlsup1 wrote:

Paul A. Clayton wrote:

An L2 register set that can only be accessed for one operand
might be somewhat similar to LD-OP.

In high speed designs, there are at least 2 cycles of delay
from AGEN
to the L2 and 2 cycles of delay back. Even zero cycle access
sees at
least 4 cycles of latency, 5 if you count AGEN.

There seems to have been confusion. I wrote "L2 _register_ set".
Being able to access a larger register name space for one operand
might be useful when value reuse often has moderate temporal
locality.

Such an L2 register set is even more complicated than load-op in
terms of compiler optimization.

Renaming a larger name space of (L2) registers would also
introduce issues. I suspect something more like a Load-Store Queue
would be used rather than a register alias table. The benefits
from specialization (e.g., smaller tags from the smaller address
space than general memory for LSQ) would conflict with the
utilization benefits of only having an LSQ.

Physical placement would also involve tradeoffs of latency (and
access energy) relative to L1 data cache. Giving prime real estate
to an L2 register file would increase L1 latency (and access
energy).

Dynamic scheduling would also be a little more complicated by
adding another latency consideration, and using banking rather
than multiporting — which becomes more reasonable at larger
capacities — would add more latency variability.

It does seem *to me* that there should be a benefit from a storage
region of intermediate capacity with simpler addressing than
general memory.

Presumably this is related to the storage technology used as
well as the capacity.

Purely wire delay due to the size of the L2 cache.

Wire delay due to physical size is related to storage technology
as well as capacity. E.g., DRAM can be denser than SRAM and thus
lower latency at larger sizes even when array access is slower.

Single-ported register storage technology would (I ass_me) be even
less dense than SRAM, such that there would be some capacity where
latency would be better with SRAM even when register storage would
be faster at the array level. Of course, latency is not the only
consideration for storage.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sat Apr 20 18:10:27 2024

From Newsgroup: comp.arch

On 3/8/24 11:14 PM, MitchAlsup1 wrote:

Paul A. Clayton wrote:

[snip interesting physical design details of SRAM/register storage]

As noted later, memory accesses can also be indexed by a fixed bit
pattern in the instruction. Determining whether a register ID bit
field is actually used may well require less decoding than
determining if an operation is a load based on stack pointer or
global pointer with an immediate offset, but the difference would
not seem to be that great. The offset size would probably also
have to be checked — the special cache would be unlikely to
support all offsets.

Predecoding on insertion into the instruction cache could cache
this usage information.

You cannot predecode if the instruction is not of fixed size, (or
if you do not add predecode bits ala Athlon, Opteron).

One can have variable length instructions and predecoding on fill
if one uses instruction bundles.

Heidi Pan's "Head and Tails" ("High Performance, Variable-Length
Instruction Encodings", Master's Thesis, 2002) uses fixed length
instruction components ("heads") filling from one end of the
bundle toward the middle and variable length components ("tails")
filling from the other end. This design intentionally disallowed
instructions crossing a bundle boundary and was primarily intended
for code density with parallel decode.

A more complex arrangement of bits than in "Heads and Tails" with
support for splitting immediate bits across bundle boundaries
could remove some of the code density penalty of "Heads and Tails"
while still supporting predecode on fill. The bundling only needs
to provide the ability to parse the bundle into instructions with
reasonable parallelism and for some uses failure to special case
via predecode some operations would not be problematic — those
instances might "merely" be unoptimized on the first execution
(after final decode in the first fetch the predecoded form could
be updated, at some complexity cost).

The borrowing aspect seems to require some additional information,
perhaps a pseudo-instruction that joins an immediate field with
the immediate part from the previous bundle. This would reduce
code density. In a "Heads and Tails"-like scheme, unused bits in
the middle might be automatically appended to the first immediate
in the next instruction.

(I seem to recall that there was an ISA that sacrificed half the
opcode space to provide variable-sized immediates. The first bit
of a parcel indicated whether it was an immediate to be patched
together or an operation and register operands. Such an encoding
is similar to the x86 instruction boundary marker bits.)

Even with My 66000's variable length instructions, most (by
frequency of occurrence) 32-bit immediates would be illegal
instructions and more significant 32-bit words in 64-bit
immediates would usually be illegal instructions, so one could
probably have highly accurate speculative predecode-on-fill.

If branch prediction fetch ahead used instruction addresses
(rather than cache block addresses), a valid target prediction
would provide accurate predecode for the following instructions
and constrain the possible decodings for preceding instructions.

Mistakes in predecode that mistook an immediate 32-bit word for an opcode-containing word might not be particularly painful.
Mistakenly "finding" a branch in predecode might not be that
painful even if predicted taken — similar to a false BTB hit
corrected in decode. Wrongly "finding" an optimizable load
instruction might waste resources and introduce a minor glitch in
decode (where the "instruction" has to be retranslated into an
immediate component).

It *feels* attractive to me to have predecode fill a BTB-like
structure to reduce redundant data storage. Filling the "BTB" with
less critical instruction data when there are few (immediate-
based) branches seems less hurtful than losing some taken branch
targets, though a parallel ordinary BTB (redundant storage) might
compensate. The BTB-like structure might hold more diverse
information that could benefit from early availability; e.g.,
loads from something like a "Knapsack Cache". (Even loads from a
more variable base might be sped by having a future file of two or
three such base addresses — or even just the least significant
bits — which could be accessed more quickly and earlier than the
general register file. Bases that are changed frequently with
dynamic values [not immediate addition] would rarely update the
future file fast enough to be useful. I think some x86
implementations did something similar by adding segment base and
displacement early in the pipeline.) More generally, it seems that
the instruction stream could be parsed and stored into components
with different tradeoffs in latency, capacity, etc.

I do not know if such "aggressive" predecode would be worthwhile
nor what in-memory format would best manage the tradeoffs of
density, parallelism, criticality, etc. or what "L1 cache" format
would be best (with added specialization/utilization tradeoffs).
--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Grey Gamer
  Wed May 1 15:05:58 2024
  from Show Low, Az via Telnet
- Grey Gamer
  Wed May 1 11:25:27 2024
  from Show Low, Az via Telnet
- Microbot
  Wed May 1 06:29:14 2024
  from Moore, Ok via Telnet
- Microbot
  Thu May 2 04:34:00 2024
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	915
Nodes:	10 (2 / 8)
Uptime:	32:13:34
Calls:	12,169
Calls today:	1
Files:	186,521
Messages:	2,234,273

Re: 88xxx or PPC

Who's Online

Recent Visitors

System Info