• Re: 88xxx or PPC

    From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sat Apr 20 15:15:04 2024
    From Newsgroup: comp.arch

    On 3/8/24 11:17 PM, MitchAlsup1 wrote:
    Paul A. Clayton wrote:

    [snip]
    Register windows were intended to avoid save/restore overhead by
    retaining values in registers with renaming. A stack cache is
    meant to reduce the overhead of loads and stores to the stack —
    not just preserving and restoring registers. A direct-mapped stack
    cache is not entirely insane. A partial stack frame cache might
    cache up to 256 bytes (e.g.) with alternating frames indexing with
    inverted bits (to reduce interference) — one could even reserve a
    chunk (e.g., 64 bytes) of a frame and not overlapped by limiting
    offset cached to be smaller than the cache.

    Such might be more useful than register windows, but that does
    not mean that it is actually a good option.

    If it is such a good option why has it not reached production ??

    (might be) more useful than register windows is not the same as
    providing a net benefit when considering the entire system.

    One obvious issue with a small stack cache is utilization. While
    generic data caches also have utilization issues (no single size
    is ideal for all workloads) and the stack cache would be small
    (and potentially highly prefetchable), the spilling and filling
    overhead at entering and exiting stack frames could be much
    greater than the savings from simple addressing (and permission
    checks) if few accesses are made within the cached part of the
    stack frame between frame spills and fills.

    A latency optimized partial frame stack cache would also benefit
    from specific sizes of higher utilization regions of stack frames
    with longish frame active periods, so compiler-based optimization
    would be a factor. Depending on microarchitecture-specific
    compiler optimization for good performance is generally avoided.
    This is related to software distribution format. If aliasing was
    not avoided by architectural contract — which would be difficult
    for any existing ISA — then handling aliases would also introduce
    overhead. (For higher utilization, one might want to avoid caching
    the registers saved at function entry, assuming these are colder
    and less latency sensitive than other values in the frame. Since
    the amount of the frame used by saved registers would vary, a
    hardware-friendly fixed uncached chunk would either waste capacity
    on cold saved registers when more registers are saved or make some
    potentially warmer values uncached [in the stack cache]. Updating
    the stack pointer to hide saved register would address this but
    would presumably introduce other issues.)

    Another factor that would reduce the attractiveness of specialized
    caches is the use of out-of-order execution. OoOE helps hide
    latency, so any latency benefit is less important.

    Not all optimization opportunities are implemented even when they
    do not conflict excessively. Part of this is the complexity and
    risks of adding new features.

    On 3/6/24 3:00 PM, MitchAlsup1 wrote:
    Paul A. Clayton wrote:
    An L2 register set that can only be accessed for one operand
    might be somewhat similar to LD-OP.

    In high speed designs, there are at least 2 cycles of delay
    from AGEN
    to the L2 and 2 cycles of delay back. Even zero cycle access
    sees at
    least 4 cycles of latency, 5 if you count AGEN.

    There seems to have been confusion. I wrote "L2 _register_ set".
    Being able to access a larger register name space for one operand
    might be useful when value reuse often has moderate temporal
    locality.

    Such an L2 register set is even more complicated than load-op in
    terms of compiler optimization.

    Renaming a larger name space of (L2) registers would also
    introduce issues. I suspect something more like a Load-Store Queue
    would be used rather than a register alias table. The benefits
    from specialization (e.g., smaller tags from the smaller address
    space than general memory for LSQ) would conflict with the
    utilization benefits of only having an LSQ.

    Physical placement would also involve tradeoffs of latency (and
    access energy) relative to L1 data cache. Giving prime real estate
    to an L2 register file would increase L1 latency (and access
    energy).

    Dynamic scheduling would also be a little more complicated by
    adding another latency consideration, and using banking rather
    than multiporting — which becomes more reasonable at larger
    capacities — would add more latency variability.

    It does seem *to me* that there should be a benefit from a storage
    region of intermediate capacity with simpler addressing than
    general memory.

    Presumably this is related to the storage technology used as
    well as the capacity.

    Purely wire delay due to the size of the L2 cache.

    Wire delay due to physical size is related to storage technology
    as well as capacity. E.g., DRAM can be denser than SRAM and thus
    lower latency at larger sizes even when array access is slower.

    Single-ported register storage technology would (I ass_me) be even
    less dense than SRAM, such that there would be some capacity where
    latency would be better with SRAM even when register storage would
    be faster at the array level. Of course, latency is not the only
    consideration for storage.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sat Apr 20 18:10:27 2024
    From Newsgroup: comp.arch

    On 3/8/24 11:14 PM, MitchAlsup1 wrote:
    Paul A. Clayton wrote:
    [snip interesting physical design details of SRAM/register storage]

    As noted later, memory accesses can also be indexed by a fixed bit
    pattern in the instruction. Determining whether a register ID bit
    field is actually used may well require less decoding than
    determining if an operation is a load based on stack pointer or
    global pointer with an immediate offset, but the difference would
    not seem to be that great. The offset size would probably also
    have to be checked — the special cache would be unlikely to
    support all offsets.

    Predecoding on insertion into the instruction cache could cache
    this usage information.

    You cannot predecode if the instruction is not of fixed size, (or
    if you do not add predecode bits ala Athlon, Opteron).

    One can have variable length instructions and predecoding on fill
    if one uses instruction bundles.

    Heidi Pan's "Head and Tails" ("High Performance, Variable-Length
    Instruction Encodings", Master's Thesis, 2002) uses fixed length
    instruction components ("heads") filling from one end of the
    bundle toward the middle and variable length components ("tails")
    filling from the other end. This design intentionally disallowed
    instructions crossing a bundle boundary and was primarily intended
    for code density with parallel decode.

    A more complex arrangement of bits than in "Heads and Tails" with
    support for splitting immediate bits across bundle boundaries
    could remove some of the code density penalty of "Heads and Tails"
    while still supporting predecode on fill. The bundling only needs
    to provide the ability to parse the bundle into instructions with
    reasonable parallelism and for some uses failure to special case
    via predecode some operations would not be problematic — those
    instances might "merely" be unoptimized on the first execution
    (after final decode in the first fetch the predecoded form could
    be updated, at some complexity cost).

    The borrowing aspect seems to require some additional information,
    perhaps a pseudo-instruction that joins an immediate field with
    the immediate part from the previous bundle. This would reduce
    code density. In a "Heads and Tails"-like scheme, unused bits in
    the middle might be automatically appended to the first immediate
    in the next instruction.

    (I seem to recall that there was an ISA that sacrificed half the
    opcode space to provide variable-sized immediates. The first bit
    of a parcel indicated whether it was an immediate to be patched
    together or an operation and register operands. Such an encoding
    is similar to the x86 instruction boundary marker bits.)

    Even with My 66000's variable length instructions, most (by
    frequency of occurrence) 32-bit immediates would be illegal
    instructions and more significant 32-bit words in 64-bit
    immediates would usually be illegal instructions, so one could
    probably have highly accurate speculative predecode-on-fill.

    If branch prediction fetch ahead used instruction addresses
    (rather than cache block addresses), a valid target prediction
    would provide accurate predecode for the following instructions
    and constrain the possible decodings for preceding instructions.

    Mistakes in predecode that mistook an immediate 32-bit word for an opcode-containing word might not be particularly painful.
    Mistakenly "finding" a branch in predecode might not be that
    painful even if predicted taken — similar to a false BTB hit
    corrected in decode. Wrongly "finding" an optimizable load
    instruction might waste resources and introduce a minor glitch in
    decode (where the "instruction" has to be retranslated into an
    immediate component).

    It *feels* attractive to me to have predecode fill a BTB-like
    structure to reduce redundant data storage. Filling the "BTB" with
    less critical instruction data when there are few (immediate-
    based) branches seems less hurtful than losing some taken branch
    targets, though a parallel ordinary BTB (redundant storage) might
    compensate. The BTB-like structure might hold more diverse
    information that could benefit from early availability; e.g.,
    loads from something like a "Knapsack Cache". (Even loads from a
    more variable base might be sped by having a future file of two or
    three such base addresses — or even just the least significant
    bits — which could be accessed more quickly and earlier than the
    general register file. Bases that are changed frequently with
    dynamic values [not immediate addition] would rarely update the
    future file fast enough to be useful. I think some x86
    implementations did something similar by adding segment base and
    displacement early in the pipeline.) More generally, it seems that
    the instruction stream could be parsed and stored into components
    with different tradeoffs in latency, capacity, etc.

    I do not know if such "aggressive" predecode would be worthwhile
    nor what in-memory format would best manage the tradeoffs of
    density, parallelism, criticality, etc. or what "L1 cache" format
    would be best (with added specialization/utilization tradeoffs).
    --- Synchronet 3.20a-Linux NewsLink 1.114