• Concertina III Once Again

    From quadi@quadibloc@ca.invalid to comp.arch on Thu May 14 08:54:51 2026
    From Newsgroup: comp.arch

    After realizing that Mitch Alsup was right in that there was no real
    benefit in speeding up instruction decode in the manner I was trying to achieve with the use of block headers, I had tried, by going from banks of
    32 registers to banks of 16 registers, to move to variable-length instructions.
    For some reason, though, I couldn't make it work. It seemed like it
    should, but I couldn't get the 16-bit instructions to fit.
    Well, I've made another attempt. And it seems like going to banks of 16 registers is indeed sufficient (retaining, from Concertina II, the
    artifice of only using seven registers as base registers and another seven
    as index registers) to fit an instruction set as complete as the one I'm aiming for in the available opcode space.
    Of course, this does give up VLIW functionality. But while VLIW may not be
    a true failure, where it works is in small-scale embedded processors. So
    I'm not going to worry about attempting to use VLIW as a more conventional alternative to Ivan Godard's more radical Mill design.
    With sequential decode, I suppose I could site immediate values after the instruction proper, but I've found that I do not have to do that, I can
    have them within the instruction body as normally indicated by its leading bits - with only one bit of awkwardness for the 64-bit immediates.

    Concertina III is described at
    http://www.quadibloc.com/arch/cy01int.htm

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 14 15:10:50 2026
    From Newsgroup: comp.arch

    On Thu, 14 May 2026 08:54:51 +0000, quadi wrote:

    With sequential decode, I suppose I could site immediate values after
    the instruction proper, but I've found that I do not have to do that, I
    can have them within the instruction body as normally indicated by its leading bits - with only one bit of awkwardness for the 64-bit
    immediates.

    I was able to correct for that. The correction came at a price, but I
    believe the price is acceptable: I can't add any more types of
    instructions that are 80 bits long or longer; but in return, not only are
    the doubleword immediates only 80 bits long, but the quadword immediates
    are only 144 bits long; neither one needs to be padded out by an extra 16 bits.

    Concertina III is described at http://www.quadibloc.com/arch/cy01int.htm

    John Savard

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 14 15:40:18 2026
    From Newsgroup: comp.arch

    I added the byte immediates to the diagram, and also I found I had opcode space for the supervisor call as a 16-bit instruction... with enough additional space to also restore the opcode space to be effectively
    unbounded, since there is also enough 16-bit opcode space for a family of
    256 16-bit instruction prefixes.

    Concertina III is described at
    http://www.quadibloc.com/arch/cy01int.htm

    John Savard

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 14 21:41:59 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    After realizing that Mitch Alsup was right in that there was no real
    benefit in speeding up instruction decode in the manner I was trying to achieve with the use of block headers, I had tried, by going from banks of 32 registers to banks of 16 registers, to move to variable-length instructions.

    See Gould S.E.L 32/87 (or /67) for ideas to save a few bits here and there along the lines of base registers and register segmentation.

    For some reason, though, I couldn't make it work. It seemed like it
    should, but I couldn't get the 16-bit instructions to fit.

    Well, I've made another attempt. And it seems like going to banks of 16 registers is indeed sufficient (retaining, from Concertina II, the
    artifice of only using seven registers as base registers and another seven as index registers) to fit an instruction set as complete as the one I'm aiming for in the available opcode space.

    Of course, this does give up VLIW functionality. But while VLIW may not be
    a true failure, where it works is in small-scale embedded processors. So
    I'm not going to worry about attempting to use VLIW as a more conventional alternative to Ivan Godard's more radical Mill design.

    Is there any "real" or even "useful" advantage of VLIW ??? Given the number
    of attempts and no real long-lasting results, history should be your guide.

    With sequential decode, I suppose I could site immediate values after the instruction proper, but I've found that I do not have to do that, I can
    have them within the instruction body as normally indicated by its leading bits - with only one bit of awkwardness for the 64-bit immediates.

    In K9, we used a packet cache of 8 instructions per fetch, and used a
    scheme called "vertical neighbor" to hold non-8-bit immediates.

    In Mc88120 we just executed the SETHI and OP instructions to paste bits together.

    The experience of both led me to My 66000 that simply appends constants
    to the instructions (1 constant per 1 instruction). The VLI decoder is
    6 gates and 2 gates of delay.) this has worked out so well, that I
    encourage others to follow suit (or outright copy...)

    Concertina III is described at
    http://www.quadibloc.com/arch/cy01int.htm

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 14 22:11:26 2026
    From Newsgroup: comp.arch

    On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

    The experience of both led me to My 66000 that simply appends constants
    to the instructions (1 constant per 1 instruction). The VLI decoder is 6 gates and 2 gates of delay.) this has worked out so well, that I
    encourage others to follow suit (or outright copy...)

    That is an approach which does have an important advantage. Right now, I
    only have immediates for the basic integer and floating-point operations.
    What about decimal floating-point immediates, for example? Appending them
    to the instruction can be simple and orthogonal.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu May 14 22:13:24 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    quadi <quadibloc@ca.invalid> posted:

    <snip>

    With sequential decode, I suppose I could site immediate values after the
    instruction proper, but I've found that I do not have to do that, I can
    have them within the instruction body as normally indicated by its leading >> bits - with only one bit of awkwardness for the 64-bit immediates.

    In K9, we used a packet cache of 8 instructions per fetch, and used a
    scheme called "vertical neighbor" to hold non-8-bit immediates.

    In Mc88120 we just executed the SETHI and OP instructions to paste bits >together.

    The experience of both led me to My 66000 that simply appends constants
    to the instructions (1 constant per 1 instruction). The VLI decoder is
    6 gates and 2 gates of delay.) this has worked out so well, that I
    encourage others to follow suit (or outright copy...)

    That was how the B3500 worked. The arithmetic instructions
    were three-operand with two source operands and a destination
    operand. (the DIV instruction produced both the quotient and
    the remainder in the destination field).

    The first operand could be a small constant (4 to 24 bits - 1 to 6 BCD digits) or
    the address of the operand in memory (optionally indexed). The
    remaining two operands were memory operands.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri May 15 02:28:27 2026
    From Newsgroup: comp.arch

    On 5/14/2026 4:41 PM, MitchAlsup wrote:

    quadi <quadibloc@ca.invalid> posted:

    After realizing that Mitch Alsup was right in that there was no real
    benefit in speeding up instruction decode in the manner I was trying to
    achieve with the use of block headers, I had tried, by going from banks of >> 32 registers to banks of 16 registers, to move to variable-length
    instructions.

    See Gould S.E.L 32/87 (or /67) for ideas to save a few bits here and there along the lines of base registers and register segmentation.

    For some reason, though, I couldn't make it work. It seemed like it
    should, but I couldn't get the 16-bit instructions to fit.

    Well, I've made another attempt. And it seems like going to banks of 16
    registers is indeed sufficient (retaining, from Concertina II, the
    artifice of only using seven registers as base registers and another seven >> as index registers) to fit an instruction set as complete as the one I'm
    aiming for in the available opcode space.

    Of course, this does give up VLIW functionality. But while VLIW may not be >> a true failure, where it works is in small-scale embedded processors. So
    I'm not going to worry about attempting to use VLIW as a more conventional >> alternative to Ivan Godard's more radical Mill design.

    Is there any "real" or even "useful" advantage of VLIW ??? Given the number of attempts and no real long-lasting results, history should be your guide.


    I am left to concede here as well:
    Both of my major ISA families (BJX1 and BJX2) had used VLIW;
    For XG3, I ended up abandoning it in favor of superscalar.

    The cost delta between VLIW and superscalar being not really enough to
    justify the hassles that VLIW brings to the table.


    There are still areas I have reservations:
    Coherent caches vs weak caches;
    Whether to have hardware partially take over for TLB management;
    Rather than the current system of TLB Miss and ACL Miss traps.
    ...


    Then, there are costs that I think are worth paying, but others disagree:
    Supporting Indexed Load/Store addressing;
    Supporting misaligned-safe memory access (at least for smaller types);
    Ability to have encodings with large immediate/displacement fields;
    ...

    But, more because each of this opens up a strong positive use-case while avoiding a semi-common adverse case:
    1. Making performance in Doom and similar not suck;
    2. Faster Huffman, Faster LZ, faster string functions, ...
    3. Not taking a hit pretty much every time an Imm/Disp fails to hit.
    ...


    Whereas, for cache coherence:
    Some approaches to multi-CPU multi-threading work, but they are ones
    that tend to "perform like hot garbage" even on clever chips, when they
    do work (*1).

    *1: Like, if people try to write code in a way than makes use of
    coherent caches on x86-64 systems, performance kinda tanks. And, the way
    to avoid performance tanking, is to write it like how it would work on incoherent caches.

    Well, if you can get the OS to not just schedule all of the threads on
    the same core that is; which is ironically, the same workaround one
    would use for incoherent caches.


    Goes and looks into it:
    Apparently this was because the cache hierarchy works in such a way that
    it was faster to schedule all of the threads on the same core than to
    risk dealing with the performance penalties of coherence handling
    between different cores. ... Yeah ...


    Well, I guess I can count it lucky in one way:
    At least the CPU I am running now has an integer divide that isn't
    abysmally slow. Because apparently its direct predecessors had
    implemented integer divide via microcode or something.

    Well, and MS left me alone WRT the whole Win11 thing as they consider my
    CPU to be "too old" (apparently not supporting anything much older than
    Zen2 or similar).


    Then again, recently another of the computers around that was still
    running a Phenom II has stopped working. Was working well enough, until
    it didn't work at all.

    Causes vary, sometimes it seems like capacitors are prone to release
    their goo and similar, etc...


    For a short while, there were a bunch of cheap "Dell OptiPlex" computers around (got a few of these), but annoyingly they stopped being so cheap
    (at this rate, should have probably ordered multiple of them last year
    when the price was extra low, but, alas...).

    Then again, seemingly even an lowly Core i3 in an OptiPlex could still
    hold its own against a Phenom II or an Athlon X2. Then again, even being
    cheap refurbs, they still somehow manage to have semi-new hardware (~ 2018..2020).

    Does seem sometimes like Intel CPUs have somewhat different performance characteristics (and seemingly better Perf/MHz), but I don't know all of
    the specifics as to why.


    With sequential decode, I suppose I could site immediate values after the
    instruction proper, but I've found that I do not have to do that, I can
    have them within the instruction body as normally indicated by its leading >> bits - with only one bit of awkwardness for the 64-bit immediates.

    In K9, we used a packet cache of 8 instructions per fetch, and used a
    scheme called "vertical neighbor" to hold non-8-bit immediates.

    In Mc88120 we just executed the SETHI and OP instructions to paste bits together.

    The experience of both led me to My 66000 that simply appends constants
    to the instructions (1 constant per 1 instruction). The VLI decoder is
    6 gates and 2 gates of delay.) this has worked out so well, that I
    encourage others to follow suit (or outright copy...)


    In my case, I used a prefix scheme:
    Op: Decode Imm directly;
    Prefix+Op: Route Prefix bits into Op decoder, decode Op;
    2x Prefix + Op:
    Decode as Prefix+Op case;
    The 2x Prefix case produces a second output holding (63:32).

    Had used a vaguely similar approach for the J21I/J52I prefixes I had
    glued onto RISC-V.

    Goal being that the part that decodes (63:32) shouldn't need to care
    about what is going on in the final Op, and vice-versa (mostly).


    Maybe could have been done better.
    In XG3, I started working on replacing the J_OP+Imm16 => Imm32/Imm33 encodings, mostly because it could be preferable (for saving cost) to
    maybe later allow these to be dropped.

    Similarly, working towards phasing out / deprecating J_IMM+J_IMM+Imm16
    for similar reasons. As it has separate decode-path logic from the J_IMM+J_IMM+Imm10 case, and it could be preferable to formalize on the
    latter. Though, for XG3, both XG1/XG2 still needing the original
    encodings. Gradually direction though is to allow XG3 to be independent
    of XG1 and XG2.



    Many cases could be replaced by Imm33 synthesis cases, but a few cases
    got wacky (and ended up adding stuff).

    MOVLD Rm, Imm32u, Rn // ~ PACK in RV
    MOVHD Rm, Imm32u, Rn // ~ PACKU in RV

    Becoming effectively one of 4 patterns:
    MOVLD Rm, Imm32u, Rn // { Rm[31: 0], Imm[31:0] }
    MOVHD Rm, Imm32u, Rn // { Rm[63:32], Imm[31:0] }
    MOVLD Imm32u, Rm, Rn // { Imm[31:0], Rm[31: 0] } (notation TBD)
    MOVHD Imm32u, Rm, Rn // { Imm[31:0], Rm[63:32] }

    This can replace both the "MOVHI32 Imm32, Rn" and "SHORI32 Imm32, Rn" encodings, and a few other cases.

    These patterns didn't strictly emerge naturally from "stick a jumbo
    prefix onto the existing instruction" case, as the decoder needs to
    special case how the instructions are decoded if there is a prefix.

    These expanding on the original MOVLD/MOVHD instructions.

    This particular trick wouldn't work on RV+Jx though...


    Things might be different though had all of this been done "entirely
    clean" vs incremental mutation though.


    Concertina III is described at
    http://www.quadibloc.com/arch/cy01int.htm

    John Savard


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri May 15 15:58:03 2026
    From Newsgroup: comp.arch

    On Thu, 14 May 2026 22:11:26 +0000, quadi wrote:

    On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

    The experience of both led me to My 66000 that simply appends constants
    to the instructions (1 constant per 1 instruction). The VLI decoder is
    6 gates and 2 gates of delay.) this has worked out so well, that I
    encourage others to follow suit (or outright copy...)

    That is an approach which does have an important advantage. Right now, I
    only have immediates for the basic integer and floating-point
    operations.
    What about decimal floating-point immediates, for example? Appending
    them to the instruction can be simple and orthogonal.

    However, while it is simple and orthogonal, I didn't like not having a
    unified scheme of decoding the lengths of instructions.
    I was able to re-organize the opcode space for instructions longer than 32 bits so as to be able to have both minimal-length immediate instructions
    for the basic operations and data types, and additional immediate
    instructions which are 16 bits longer for additional operands and data
    types, so I went that way.
    Another issue is that in Concertina II, while an additional bit indicated
    a pseudo-immediate, the register field did not go to waste; it was used to indicate the position of the pseudo-immediate in the block. So appending immediates would have been a temptation to decree that register 15 or
    register 0 couldn't be the source for a register operand, which would be
    bad.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sat May 16 03:57:06 2026
    From Newsgroup: comp.arch

    On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

    Is there any "real" or even "useful" advantage of VLIW ??? Given the
    number of attempts and no real long-lasting results, history should be
    your guide.

    As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
    that a VLIW processor is highly successful as an embedded video processor.
    Of course, though, he is hardly a disinterested source.

    But the idea that putting bits in instructions to indicate that they can
    be executed in parallel can enhance pipelining without the huge overhead
    of out-of-order execution seems plausible to me. It's the same sort of argument that Ivan Godard made for his innovative Mill design. You've
    noted, though, that unlike register hazards, cache misses, which are unpredictable by compilers, can be handled by a simpler form of OoO, the scoreboard of the 6600.

    In a way, Concertina II is VLIW "perfected" - by putting the bits that indicate parallelism in a header at the start of the block, the price of indicating parallelism isn't a shorter instruction word, and hence having
    to make do with fewer registers, or shorter displacement fields, all
    things that do have an obvious negative impact on performance.

    And by going from the block-oriented Concertina II design to the variable- length instruction Concertina III design, I've gone from banks of 32
    registers to banks of 16 registers!

    Did I have to do this?

    In Concertina III, instructions longer than 32 bits take up 1/16 of the
    opcode space. Adding a bit so as to use 32 registers instead of 16 would change that to 1/8.

    In Concertina II, the 32-bit instructions take up about 3/4 of the opcode space.

    So an ISA without block structure, with variable-length instructions
    instead, with banks of 32 registers is possible! However, only 1/8 of the opcode space would be left for short instructions, and 16-bit instructions with only 13 bits available... would be largely useless. If having the
    option of using 16-bit instructions is the primary benefit of having variable-length instructions, instead of every instruction being 32 bits long... then attempting to obtain the best of Concertina II and III in a single design through this artifice... which seems so very tempting... is
    a mistake.

    Of course, the 360 managed to get by quite well with only 1/4 of the opcode space used by 16-bit instructions. Could 14 bits be useful where 13 bits
    are doomed to fail, and if so, what contrivance could I possibly use to squeeze out that extra opcode space... since I've tried, and abandoned as fatally flawed, a _lot_ of contrivances to squeeze out space in just that
    way in the development of Concertina II?

    Block structure had the advantage of letting me pack more bits in instructions. That it let me offer VLIW, in the sense of controlling
    parallel execution, as an option... was just gravy.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sat May 16 04:13:01 2026
    From Newsgroup: comp.arch

    On Sat, 16 May 2026 03:57:06 +0000, quadi wrote:

    Could 14 bits be useful where
    13 bits are doomed to fail,

    Actually, though, I had worked out two ways where 16 bit short
    instructions that all must start with 111 could perhaps do useful work.

    The first one was:

    111 + (seven bit opcode) + (3) + (3)

    Just have operate instructions that only use the first eight registers.

    And then I came up with an alternative:

    111 + (seven bit opcode) + (1) + (5)
    11111 + (seven bit opcode) + (3) + (1)

    since I'm only using 96 opcodes, not 128.

    In the primary format, only registers 0 and 1 are destination registers,
    but all 32 registers are source registers.

    The secondary format tries to balance that out by letting results in those
    two accumulators participate in operations with the first eight registers
    as destination registers. So those first two registers are still a
    bottleneck, but the need to add extra operations to move results out of
    those registers is, hopefully, reduced.

    But as far as I know, nobody has tried to design an ISA this way, so
    nobody has tried to figure out how to write a compiler to make effective
    use of such a design.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sat May 16 13:13:59 2026
    From Newsgroup: comp.arch

    On Sat, 16 May 2026 04:13:01 +0000, quadi wrote:
    On Sat, 16 May 2026 03:57:06 +0000, quadi wrote:

    Could 14 bits be useful where 13 bits are doomed to fail,

    Actually, though, I had worked out two ways where 16 bit short
    instructions that all must start with 111 could perhaps do useful work.

    The first one was:

    111 + (seven bit opcode) + (3) + (3)

    I have finally realized that there is a way to turn the impossible goal
    that seemed so tantalizingly close to achievement into something possible.

    Just add

    11111 +
    (break bit) +
    (seven-bit opcode) +
    (condition code bit) +
    (five-bit destination register) +
    (five-bit source register)

    and there you have it. An operate instruction that has five-bit source and destination register fields, and is shorter than 32 bits.

    What's that? It isn't sixteen bits long! No, it isn't. But if each
    instruction indicates how long it is with its starting bits, and then one looks for the next instruction where that one ends... then instructions
    can start anywhere.

    Well, at least the last bit in the displacement field of a jump
    instruction is no longer going to waste.

    I wanted to follow the illustrious example of the 68000 and the
    System/360, instead of the disaster that is x86, but if 24-bit short instructions are the price of having register banks with 32 registers -
    *and* continuing to have the option of 12-bit displacements and 20-bit displacements - then it has to be paid.

    Yes. Concertina IV is coming. Be very afraid?

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat May 16 17:35:04 2026
    From Newsgroup: comp.arch

    According to quadi <quadibloc@ca.invalid>:
    On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

    Is there any "real" or even "useful" advantage of VLIW ??? Given the
    number of attempts and no real long-lasting results, history should be
    your guide.

    As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
    that a VLIW processor is highly successful as an embedded video processor. >Of course, though, he is hardly a disinterested source.

    It works great in programs where the compiler can predict the sequence of memory
    references at compile time, much less well when the sequence is data dependent.

    I can believe that video processing falls into the first category.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat May 16 12:38:58 2026
    From Newsgroup: comp.arch

    On 5/15/2026 10:57 PM, quadi wrote:
    On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

    Is there any "real" or even "useful" advantage of VLIW ??? Given the
    number of attempts and no real long-lasting results, history should be
    your guide.

    As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
    that a VLIW processor is highly successful as an embedded video processor.
    Of course, though, he is hardly a disinterested source.

    But the idea that putting bits in instructions to indicate that they can
    be executed in parallel can enhance pipelining without the huge overhead
    of out-of-order execution seems plausible to me. It's the same sort of argument that Ivan Godard made for his innovative Mill design. You've
    noted, though, that unlike register hazards, cache misses, which are unpredictable by compilers, can be handled by a simpler form of OoO, the scoreboard of the 6600.


    It is more VLIW vs In-Order, and In-Order vs OoO.

    VLIW vs In-Order:
    VLIW:
    + Slightly cheaper logic;
    - Binary depends more on processor specifics.
    In-Order:
    - Needs logic to handle register deps and lookup opcode flags.
    + Code does not depend on uArch.

    In-Order vs Out-of-Order:
    In-Order:
    + Simpler hardware
    - Not as fast
    OoO:
    - Complex hardware (reorder buffer, scoreboard/renamer, ...)
    + Faster


    Both VLIW and In-Order benefit from a large register file.
    OoO mostly benefits ISA designs that would otherwise be slow.
    Mostly absorbing the cost of a lot of the ISA level inefficiencies.


    Theoretically, OoO can better absorb cache misses, however my own
    testing implies that the delta vs "cache miss results in pipeline stall"
    vs "delay instruction to hide miss" appears to be mostly negligible.


    Also raw CPU speed doesn't matter as much when the computation is
    primarily limited by RAM bandwidth or latency (seems to be a pretty
    common scenario IME).


    In my case, I had realized that In-Order could be handled nearly exactly
    the same as my prior LIW handling (no real changes needed to the
    pipeline, etc), with the primary change that the I$ can have logic to
    detect which instructions can run in parallel during cache line fetch,
    and doing this is in-effect cheap enough to be worthwhile (the in-order
    not adding any significant resource cost over LIW).

    So, in my case, 16 byte cache lines, in Op0..Op3:
    Can Op0 co-execute with Op1?
    Can Op1 co-execute with Op2?
    Can Op2 co-execute with Op3?
    Can Op0/1/2 co-execute?
    Can Op1/2/3 co-execute?

    Not too unreasonable.
    Implementation currently can't deal with 16-bit ops, checks across cache lines, misaligned ops, mixed RV and XG3 sequences, ... But, still mostly
    works reasonably OK. An implementation that dealt with all of these edge
    cases (such as to not take a significant performance hit with RV-C)
    would have added a little more cost though.

    The co-mixed RV and XG3 scenario was mostly limited because checking for register aliases between the mismatched register fields (reg5/reg6) was
    more expensive than ideal. So, cheaper to only check between ops of the
    same type and assume that mismatched ops may potentially have a
    register-alias (even when they don't).


    So, say, superscalar logic was a lookup over opcode bits for flags like:
    can this op run in Lane 2?
    Can this op run in Lane 3?
    Can this op run with another op in Lane 2?
    Can this op run with another op in Lane 3?
    Does this op use Rd as a source?
    Does this op use Rt as a source?
    ...
    Then, say, checks between register fields:
    Rd0==Rs1, Rd0==Rt1
    Rd1==Rs0, Rd1==Rt0
    ...

    Then, feed all of these bits through a few lookups, reducing it to the
    "Can Op0/Op1 co-execute?" question.

    Not free, but fast enough to be handled when a new cache line arrives
    (cache line generally being mode-tagged, etc).

    ...


    In a way, Concertina II is VLIW "perfected" - by putting the bits that indicate parallelism in a header at the start of the block, the price of indicating parallelism isn't a shorter instruction word, and hence having
    to make do with fewer registers, or shorter displacement fields, all
    things that do have an obvious negative impact on performance.

    And by going from the block-oriented Concertina II design to the variable- length instruction Concertina III design, I've gone from banks of 32 registers to banks of 16 registers!

    Did I have to do this?

    In Concertina III, instructions longer than 32 bits take up 1/16 of the opcode space. Adding a bit so as to use 32 registers instead of 16 would change that to 1/8.

    In Concertina II, the 32-bit instructions take up about 3/4 of the opcode space.

    So an ISA without block structure, with variable-length instructions
    instead, with banks of 32 registers is possible! However, only 1/8 of the opcode space would be left for short instructions, and 16-bit instructions with only 13 bits available... would be largely useless. If having the
    option of using 16-bit instructions is the primary benefit of having variable-length instructions, instead of every instruction being 32 bits long... then attempting to obtain the best of Concertina II and III in a single design through this artifice... which seems so very tempting... is
    a mistake.

    Of course, the 360 managed to get by quite well with only 1/4 of the opcode space used by 16-bit instructions. Could 14 bits be useful where 13 bits
    are doomed to fail, and if so, what contrivance could I possibly use to squeeze out that extra opcode space... since I've tried, and abandoned as fatally flawed, a _lot_ of contrivances to squeeze out space in just that
    way in the development of Concertina II?

    Block structure had the advantage of letting me pack more bits in instructions. That it let me offer VLIW, in the sense of controlling
    parallel execution, as an option... was just gravy.


    Bits, elsewhere, are still bits.
    And, things like the pigeon principle and similar still apply.

    Now you have mostly just added the issue that either there is a spot
    that can't be used and needs to be skipper over (per-block), the number
    of instructions per block is NPOT and/or the instruction size is NPOT.

    One could try making the memory blocks NPOT, but this itself adds suck.

    John Savard

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat May 16 12:58:28 2026
    From Newsgroup: comp.arch

    On 5/16/2026 12:35 PM, John Levine wrote:
    According to quadi <quadibloc@ca.invalid>:
    On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

    Is there any "real" or even "useful" advantage of VLIW ??? Given the
    number of attempts and no real long-lasting results, history should be
    your guide.

    As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
    that a VLIW processor is highly successful as an embedded video processor. >> Of course, though, he is hardly a disinterested source.

    It works great in programs where the compiler can predict the sequence of memory
    references at compile time, much less well when the sequence is data dependent.

    I can believe that video processing falls into the first category.


    I suspect personally (based on behaviors I have seen in existing consumer-grade processors):
    It can either be predicted in advance;
    Or, it can't reliably be predicted at all.

    Seemingly, if comparing modern fancy CPUs with designs with a
    competently designed ISAs (but In-Order):
    I haven't usually seen all that strong of a divergence between in-order
    and out-of-order results in various benchmarks (when excluding those
    that are determined primarily by "How fast does the RAM go?").

    Like, seemingly everything mostly scales fairly linearly with
    clock-speed, and with OoO seemingly only gaining a fairly minor bump.

    Well, at least if excluding things like:
    Short/tight loop with stupidly complex arithmetic expression.
    OoO does pretty well at these...


    But, then, this is a coding style that is better off not used, because
    it often performs poorly in-general. And, seemingly, when writing code
    in ways that perform well in general, much of the advantages seemingly evaporate (well, and/or it becomes RAM speed bound, whichever happens
    first).

    Like, big fancy/expensive tool that mostly compensates for some
    combination of poor ISA and poorly optimized code.


    ...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 16 18:23:32 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 5/15/2026 10:57 PM, quadi wrote:
    On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

    Is there any "real" or even "useful" advantage of VLIW ??? Given the
    number of attempts and no real long-lasting results, history should be
    your guide.

    As I've noted, "the guy who invented VLIW" claimed, in a YouTube video, that a VLIW processor is highly successful as an embedded video processor. Of course, though, he is hardly a disinterested source.

    But the idea that putting bits in instructions to indicate that they can
    be executed in parallel can enhance pipelining without the huge overhead
    of out-of-order execution seems plausible to me. It's the same sort of argument that Ivan Godard made for his innovative Mill design. You've noted, though, that unlike register hazards, cache misses, which are unpredictable by compilers, can be handled by a simpler form of OoO, the scoreboard of the 6600.


    It is more VLIW vs In-Order, and In-Order vs OoO.

    VLIW vs In-Order:
    VLIW:
    + Slightly cheaper logic;
    - Binary depends more on processor specifics.
    In-Order:
    - Needs logic to handle register deps and lookup opcode flags.
    + Code does not depend on uArch.

    In-Order vs Out-of-Order:
    In-Order:
    + Simpler hardware
    - Not as fast
    OoO:
    - Complex hardware (reorder buffer, scoreboard/renamer, ...)
    + Faster

    S/Faster/Higher Performing/

    Both VLIW and In-Order benefit from a large register file.
    OoO mostly benefits ISA designs that would otherwise be slow.
    Mostly absorbing the cost of a lot of the ISA level inefficiencies.


    Theoretically, OoO can better absorb cache misses, however my own
    testing implies that the delta vs "cache miss results in pipeline stall"
    vs "delay instruction to hide miss" appears to be mostly negligible.

    You do not have an execution pipeline with depth > L1 cache miss
    latency. When you do, new effects become feasible--like beginning
    the second next loop iteration before the first one has completed.
    This is where you can now absorb the L1 cache miss latency.


    Also raw CPU speed doesn't matter as much when the computation is
    primarily limited by RAM bandwidth or latency (seems to be a pretty
    common scenario IME).


    In my case, I had realized that In-Order could be handled nearly exactly
    the same as my prior LIW handling (no real changes needed to the
    pipeline, etc), with the primary change that the I$ can have logic to
    detect which instructions can run in parallel during cache line fetch,

    When you do not have condition codes, and only 1 register file, you
    can determine parallel-ness by simply looking at the registers.

    and doing this is in-effect cheap enough to be worthwhile (the in-order
    not adding any significant resource cost over LIW).

    So, in my case, 16 byte cache lines, in Op0..Op3:
    Can Op0 co-execute with Op1?
    When Rd-1 ~= either{SRC1-2, or SRC2-2}
    Can Op1 co-execute with Op2?
    When Rd-1 ~= either{SRC1-3, or SRC2-3}
    Can Op2 co-execute with Op3?
    When Rd-2 ~= either{SRC1-3, or SRC2-3}
    Can Op0/1/2 co-execute?
    Can Op1/2/3 co-execute?
    ----------
    So, say, superscalar logic was a lookup over opcode bits for flags like:
    can this op run in Lane 2?
    Depends on what is in Lane 2
    Can this op run in Lane 3?
    ...
    Can this op run with another op in Lane 2?
    Can this op run with another op in Lane 3?
    Does this op use Rd as a source?
    Does this op use Rt as a source?

    Given nomenclature like Mc88120 where {
    Lanes = {MEM0, MEM1, MEM2, FADD, FMUL, Branch}
    And MEM has an integer unit, and a shift unit
    FADD has an integer unit
    FMUL has an integer unit
    Branch has an integer unit }
    And each unit is buffered with its own reservation station;
    You just let the RSs create a solution.

    Given nomenclature like M5 with >10 FUs, the calculation is harder,
    but you still just let the RSs create the solution.
    --------------
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat May 16 20:48:45 2026
    From Newsgroup: comp.arch

    John Levine wrote:
    According to quadi <quadibloc@ca.invalid>:
    On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

    Is there any "real" or even "useful" advantage of VLIW ??? Given the
    number of attempts and no real long-lasting results, history should be
    your guide.

    As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
    that a VLIW processor is highly successful as an embedded video processor. >> Of course, though, he is hardly a disinterested source.

    It works great in programs where the compiler can predict the sequence of memory
    references at compile time, much less well when the sequence is data dependent.

    I can believe that video processing falls into the first category.

    Feel free to believe so, then read up on the CABAC h.264 encoding:

    This is an arithmetic compression setup where you for every bit decoded
    have to make a branch to separate code using a different context (that
    is the context-adaptive binary arithmetic coding which gave the acronym).

    What it means is that any sw decoder will have a 50% branch where you
    cannot "simply" execute both parts in parallel, or use the same code
    just with context-dependent table lookups.

    It is fine for HW, pretty much pessimal for SW.

    It works due to two factors: (a) Most/many videos use the much more SW-friendly alternative encoding which provides a few less percent
    compression rate but at comparably lower encode/decode cost, and (b) cpu vendors like Intel license a chunk of VLSI intellectual property which
    does major parts (or all?) in hardware, mostly because it also saves a
    ton of power, allowing a cell phone or laptop to play video without
    running out of battery power long before the film ends.


    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 17 15:24:20 2026
    From Newsgroup: comp.arch

    On Sat, 16 May 2026 13:13:59 +0000, quadi wrote:
    On Sat, 16 May 2026 04:13:01 +0000, quadi wrote:

    The first one was:

    111 + (seven bit opcode) + (3) + (3)

    I have finally realized that there is a way to turn the impossible goal
    that seemed so tantalizingly close to achievement into something
    possible.

    Just add

    11111 +
    (break bit) +
    (seven-bit opcode) +
    (condition code bit) +
    (five-bit destination register) +
    (five-bit source register)

    The thing is, though, that in Concertina IV, I want to bring back some
    things that Concertina III, with banks of 16 registers, had to give up. 20-
    bit long displacements, for one, and extended register banks of 128
    registers for another.

    So I need additional opcode space for 48-bit instructions.

    I have come up with a place to find it.

    Just drop the 16-bit short instructions entirely, as, being confined to
    the first eight registers, they're not very useful (? actually, they could
    be quite useful, unless code that isn't spread out in a large register
    bank hits performance badly, which is likely to be the case in a typical Concertina IV implementation, given that its design is closer to that of Concertina II than III) keeping only the 24-bit short instructions. That almost doubles the opcode space left for instructions larger than 32 bits.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun May 17 15:37:47 2026
    From Newsgroup: comp.arch

    On 5/16/2026 1:23 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/15/2026 10:57 PM, quadi wrote:
    On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

    Is there any "real" or even "useful" advantage of VLIW ??? Given the
    number of attempts and no real long-lasting results, history should be >>>> your guide.

    As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
    that a VLIW processor is highly successful as an embedded video processor. >>> Of course, though, he is hardly a disinterested source.

    But the idea that putting bits in instructions to indicate that they can >>> be executed in parallel can enhance pipelining without the huge overhead >>> of out-of-order execution seems plausible to me. It's the same sort of
    argument that Ivan Godard made for his innovative Mill design. You've
    noted, though, that unlike register hazards, cache misses, which are
    unpredictable by compilers, can be handled by a simpler form of OoO, the >>> scoreboard of the 6600.


    It is more VLIW vs In-Order, and In-Order vs OoO.

    VLIW vs In-Order:
    VLIW:
    + Slightly cheaper logic;
    - Binary depends more on processor specifics.
    In-Order:
    - Needs logic to handle register deps and lookup opcode flags.
    + Code does not depend on uArch.

    In-Order vs Out-of-Order:
    In-Order:
    + Simpler hardware
    - Not as fast
    OoO:
    - Complex hardware (reorder buffer, scoreboard/renamer, ...)
    + Faster

    S/Faster/Higher Performing/

    Both VLIW and In-Order benefit from a large register file.
    OoO mostly benefits ISA designs that would otherwise be slow.
    Mostly absorbing the cost of a lot of the ISA level inefficiencies.


    Theoretically, OoO can better absorb cache misses, however my own
    testing implies that the delta vs "cache miss results in pipeline stall"
    vs "delay instruction to hide miss" appears to be mostly negligible.

    You do not have an execution pipeline with depth > L1 cache miss
    latency. When you do, new effects become feasible--like beginning
    the second next loop iteration before the first one has completed.
    This is where you can now absorb the L1 cache miss latency.


    OK. I am not sure what mainstream CPUs are doing here.


    Mainly I had been running AMD chips, but had often failed to see much
    that puts them clearly outside the realm of what could be expected from extrapolating the clock speeds and adding some minor fudge-factors
    relative to modeling an in-order design (*1).

    There is a difference for Intel CPUs though, which generally appear to
    give better perf per clock in various cases, and have performance
    behaviors which don't really match up to modeling them as an in-order
    machine.


    *1: Though not necessarily one running x86...

    One needs to assume significantly more registers and that store+reload
    to the same address behaves like a register MOV and similar. But, it is
    like, if one replaces x86 with a RISC-style ISA, the overall performance behavior can match up pretty well.

    Otherwise. one may also need to assume the ability to co-issue memory
    loads.


    Say, for example, if one were to pretend that a Piledriver sort of
    looked like:
    64 registers
    3 ALUs
    1 LD/ST per clock (with a 4 cycle load latency)
    24 cycle branch mispredict
    1 SIMD op per clock
    ...

    And, Zen+ sorta like, say:
    4 ALUs
    2 LD / 1 ST per clock (4c load)
    12 cycle branch mispredict
    1 SIMD op per clock
    ...


    But, as noted, this approach seems to fall apart with Intel CPUs, which
    seem to diverge more noticeably from predictions one could make based on assuming an in-order model (and not as easily modeled in general).


    If one assumes a 32K direct-mapped L1 + 4 way victim cache and 32-byte
    cache lines, this also seems to match up reasonably well. Like, it isn't
    quite as smooth as a 4-way associative cache, nor as poorly behaved as a
    plain direct-mapped L1.

    But, the DM L1 + VC approach was based on my own design efforts, but
    does still appear curiously close to benchmarks run on PC class
    hardware. In my case, I went with a 4-way VC mostly for cost
    optimization (gains from going 8-way were small, but cost was steep).

    But, it is strange in a way, as AFAIK the x86 chips have native set-associative caches.


    Do need to assume a high associativity for the L2 caches though (unlike
    my designs which had used a direct-mapped L2 and the VC was more to
    reduce "damage" caused to the L2 cache by the L1 conflict misses, which
    were comparably more expensive in L2 land).


    One difference being that for the PC, one needs to assume that
    load/store ordering is preserved between cores, but to be modeled one
    can add around an 80 cycle or so penalty for every time a line is
    modified on one core and then read on another.

    ...



    Also raw CPU speed doesn't matter as much when the computation is
    primarily limited by RAM bandwidth or latency (seems to be a pretty
    common scenario IME).


    In my case, I had realized that In-Order could be handled nearly exactly
    the same as my prior LIW handling (no real changes needed to the
    pipeline, etc), with the primary change that the I$ can have logic to
    detect which instructions can run in parallel during cache line fetch,

    When you do not have condition codes, and only 1 register file, you
    can determine parallel-ness by simply looking at the registers.


    Yes.

    There is a little more though depending on "how" the instructions may be
    used in a core where not all lanes are the same.


    and doing this is in-effect cheap enough to be worthwhile (the in-order
    not adding any significant resource cost over LIW).

    So, in my case, 16 byte cache lines, in Op0..Op3:
    Can Op0 co-execute with Op1?
    When Rd-1 ~= either{SRC1-2, or SRC2-2}
    Can Op1 co-execute with Op2?
    When Rd-1 ~= either{SRC1-3, or SRC2-3}
    Can Op2 co-execute with Op3?
    When Rd-2 ~= either{SRC1-3, or SRC2-3}
    Can Op0/1/2 co-execute?
    Can Op1/2/3 co-execute?
    ----------
    So, say, superscalar logic was a lookup over opcode bits for flags like:
    can this op run in Lane 2?
    Depends on what is in Lane 2


    My case:
    Lane 1:
    MOV, ALU, CONV1/2, SHAD, CMP, LEA/MEM, MUL,
    BRANCH, FPU/SIMD (*), ...
    Lane 2:
    MOV, ALU, CONV1/2, SHAD, LEA, FPU/SIMD (*)
    *: But, Lane 1/2 can't co-issue FPU or SIMD unless "compatible".
    Lane 3:
    MOV, ALU, CONV1

    Where:
    MOV: 1=cycle register MOV (includes constant load);
    ALU: Basic ALU instructions
    CONV1: Basic Converter ops (sign/zero extension, etc)
    CONV2: Advanced converter ops (SIMD, etc)
    SHAD: Integer Shift
    CMP: Integer Compare
    LEA: Address computation
    MEM: Memory Load/Store
    MUL: Integer Multiply
    BRANCH: Obvious enough
    FPU/SIMD: Obvious enough

    Lane 3 originally had SHAD as well, but it was dropped mostly because
    rarely used so harder to justify cost.


    So, based on which lanes have the needed units, it determines which
    flags for the lanes it is allowed to run in.


    There are some gains from having the compiler trying shuffle
    instructions around and look for an ordering that fits well into the
    pipeline.


    Though, a more advanced IF stage would maybe have a mechanism to allow swapping instructions if they would be able to co-execute but would need
    to do so in a different lane ordering.

    This could maybe overlap with a TODO item of making fetches
    align/justify the instructions with the correct lanes rather than have
    the ID stage deal with this part (say, for example, so that Lane3 always
    uses the same decoder and could allow for more corner-cutting, and less register-port routing logic).


    Can this op run in Lane 3?
    ...
    Can this op run with another op in Lane 2?
    Can this op run with another op in Lane 3?
    Does this op use Rd as a source?
    Does this op use Rt as a source?

    Given nomenclature like Mc88120 where {
    Lanes = {MEM0, MEM1, MEM2, FADD, FMUL, Branch}
    And MEM has an integer unit, and a shift unit
    FADD has an integer unit
    FMUL has an integer unit
    Branch has an integer unit }
    And each unit is buffered with its own reservation station;
    You just let the RSs create a solution.

    Given nomenclature like M5 with >10 FUs, the calculation is harder,
    but you still just let the RSs create the solution.
    --------------


    I was using a convention where the pipeline is divided into 3 lanes.
    So, everything plugs into a single unified pipeline, rather than a
    separate sub-pipeline for each FU.


    Each has fixed register ports and other resources:
    Lane 1: Rs, Rt, Imm1 (33b)
    Lane 2: Ru, Rv, Imm2 (33b)
    Lane 3: Rx, Ry, Imm3 (17b/33b)

    Each lane has a register write port, along with some flag bits for
    whether the result is ready (for sake of register forwarding), ...


    Can put an ALU op into each, or whichever instruction into a given lane
    that the lane in question has access to a unit capable of handling it.

    Say, for example:
    If you tried putting a Shift of Multiply or SIMD op in Lane 3, it
    wouldn't work, because Lane 3 lacks the logic to handle it.

    As noted, each lane normally only has 2 register read ports and 1
    immediate. If an instruction needs 3 inputs, or a 2nd immediate (17b
    only for now), it eats Lane 3 (which can no longer hold an instruction).

    Or, if a SIMD op happens, Lane 1 eats both 2 and 3, turning them
    effectively into a single wider lane for that instruction.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon May 18 17:39:59 2026
    From Newsgroup: comp.arch

    On Sun, 17 May 2026 15:24:20 +0000, quadi wrote:

    So I need additional opcode space for 48-bit instructions.

    I managed to find enough space for the 48-bit instructions without taking
    any from elsewhere.

    However, I'm now encountering a problem with the 32-bit instructions.
    Given how I'm handlng other sizes of immediates, I want all 32 registers
    to be possible destinations for the 16-bit immediates.

    This leads to an opcode space shortage for 32-bit operate instructions.
    There was a little slack in the existing 32-bit instructions that I could squeeze, but not enough.

    The amount needed, though, is 1/3 the size of what the 16-bit short instructions take, or the same as what the 24-bit short instructions take.

    Possible easy and obvious alternatives:

    1) Drop the 24-bit short instructions, they're a weird length.
    2) Go to 6-bit opcodes for the 16-bit short instructions, limiting them to
    the most important data types.
    3) Stick with only 8 (or even only 16) registers as the destination of a 16-bit immediate.

    Maybe I can squeeze more and avoid having to do any of them; if I must
    choose, (2) sounds like the most attractive, as a short instruction that
    can only work on the first 8 registers is disfavored anyways.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 18 17:53:23 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Sun, 17 May 2026 15:24:20 +0000, quadi wrote:

    So I need additional opcode space for 48-bit instructions.

    I managed to find enough space for the 48-bit instructions without taking any from elsewhere.

    However, I'm now encountering a problem with the 32-bit instructions.
    Given how I'm handlng other sizes of immediates, I want all 32 registers
    to be possible destinations for the 16-bit immediates.

    This leads to an opcode space shortage for 32-bit operate instructions. There was a little slack in the existing 32-bit instructions that I could squeeze, but not enough.

    The amount needed, though, is 1/3 the size of what the 16-bit short instructions take, or the same as what the 24-bit short instructions take.

    I think you should introduce Peter to Paul.

    Possible easy and obvious alternatives:

    1) Drop the 24-bit short instructions, they're a weird length.
    2) Go to 6-bit opcodes for the 16-bit short instructions, limiting them to the most important data types.

    And only for the most used OpCodes.

    3) Stick with only 8 (or even only 16) registers as the destination of a 16-bit immediate.

    Unlikely to be compiler friendly.

    Maybe I can squeeze more and avoid having to do any of them; if I must choose, (2) sounds like the most attractive, as a short instruction that
    can only work on the first 8 registers is disfavored anyways.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 18 10:57:29 2026
    From Newsgroup: comp.arch

    On 5/17/2026 1:37 PM, BGB wrote:

    snip

    It is more VLIW vs In-Order, and In-Order vs OoO.

    VLIW vs In-Order:
        VLIW:
          + Slightly cheaper logic;
          - Binary depends more on processor specifics.
        In-Order:
          - Needs logic to handle register deps and lookup opcode flags. >>>       + Code does not depend on uArch.

    Another disadvantage is less efficient memory utilization due to taking
    up space for the template bits. It also causes a correspondingly less efficient memory bandwidth usage. This is particularly apparent in
    EPIC, as they only get three instructions in 128 bits versus four in a traditional RISC (Although you could argue the longer instructions do
    more, but this isn't proven.).
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue May 19 00:40:46 2026
    From Newsgroup: comp.arch

    On 5/18/2026 12:57 PM, Stephen Fuld wrote:
    On 5/17/2026 1:37 PM, BGB wrote:

    snip

    It is more VLIW vs In-Order, and In-Order vs OoO.

    VLIW vs In-Order:
        VLIW:
          + Slightly cheaper logic;
          - Binary depends more on processor specifics.
        In-Order:
          - Needs logic to handle register deps and lookup opcode flags. >>>>       + Code does not depend on uArch.

    Another disadvantage is less efficient memory utilization due to taking
    up space for the template bits.  It also causes a correspondingly less efficient memory bandwidth usage.  This is particularly apparent in
    EPIC, as they only get three instructions in 128 bits versus four in a traditional RISC (Although you could argue the longer instructions do
    more, but this isn't proven.).


    That is not the only way of encoding it.


    I ended up with a 2-bit pattern in each 32-bit instruction:
    00: ?T (Conditional)
    01: ?F (Conditional)
    10: Scalar / Final
    11: WEX (Non-Final)

    There was an instruction for loading a large immediate, where:
    00/01: Conditional+WEX Sub-Block (PrWEX)
    10: Large Constant Load (with a fixed destination register)
    11: Jumbo Prefix

    Though, for XG3:
    11 was reused for 32-bit RISC-V instructions.
    The large constant-load instruction was replaced with the Jumbo Prefix case; PrWEX: Currently Unused / Reserved, experimentally used for Pair-Packed instructions, but these failed the "effective enough to justify
    existing" test.


    Was tempted to consider dropping conditional ops, but I ended up
    deciding to keep them. They are a big chunk of potential encoding space,
    but they can help with performance (and the approach taken in Mitch's
    ISA has some of its own drawbacks which mine avoids, even if, yes, it
    does eat a whole bit of entropy).


    So, XG3 loses the option of explicit LIW/VLIW, but gains superscalar,
    which I had already added for RISC-V. I was half tempted to consider a superscalar mode for XG2, but it is unclear if XG1 and XG2 will have a long-term future.



    There was a possibility of re-adding a compacted form of RV-C into the
    space used by the predicated ops, but I decided against this.

    Well, and as noted, even without either 16-bit ops or pair-packing, XG3
    has still managed to become competitive with RV64GC+Jx in terms of code density (where RV64GC+Jx also beats plain RV64GC as well). Well, and
    seemingly also BGBCC is beating GCC for code-density with plain RV64G
    and RV64GC, though in the basic/standardized ISA GCC's output is still
    faster (though not drastically).

    Weird though, one would almost expect GCC to be like some unbeatable
    titan... Not like something that can be roughly matched by a compiler
    which still fails to have a non brain-dead register allocator.

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue May 19 11:22:32 2026
    From Newsgroup: comp.arch

    On 5/18/2026 10:40 PM, BGB wrote:
    On 5/18/2026 12:57 PM, Stephen Fuld wrote:
    On 5/17/2026 1:37 PM, BGB wrote:

    snip

    It is more VLIW vs In-Order, and In-Order vs OoO.

    VLIW vs In-Order:
        VLIW:
          + Slightly cheaper logic;
          - Binary depends more on processor specifics.
        In-Order:
          - Needs logic to handle register deps and lookup opcode flags. >>>>>       + Code does not depend on uArch.

    Another disadvantage is less efficient memory utilization due to
    taking up space for the template bits.  It also causes a
    correspondingly less efficient memory bandwidth usage.  This is
    particularly apparent in EPIC, as they only get three instructions in
    128 bits versus four in a traditional RISC (Although you could argue
    the longer instructions do more, but this isn't proven.).


    That is not the only way of encoding it.

    Of course. I was merely citing a particularly bad example.


    I ended up with a 2-bit pattern in each 32-bit instruction:
      00: ?T (Conditional)
      01: ?F (Conditional)
      10: Scalar / Final
      11: WEX (Non-Final)


    So you have reduced the loss of memory efficiency to about 6% (4/32). Certainly smaller, but still there. BTW, I don't understand the
    conditionals. On what basis do they make their decision? Are they part
    of some predication scheme?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue May 19 11:47:29 2026
    From Newsgroup: comp.arch

    On 5/19/2026 11:22 AM, Stephen Fuld wrote:
    On 5/18/2026 10:40 PM, BGB wrote:
    On 5/18/2026 12:57 PM, Stephen Fuld wrote:
    On 5/17/2026 1:37 PM, BGB wrote:

    snip

    It is more VLIW vs In-Order, and In-Order vs OoO.

    VLIW vs In-Order:
        VLIW:
          + Slightly cheaper logic;
          - Binary depends more on processor specifics.
        In-Order:
          - Needs logic to handle register deps and lookup opcode flags.
          + Code does not depend on uArch.

    Another disadvantage is less efficient memory utilization due to
    taking up space for the template bits.  It also causes a
    correspondingly less efficient memory bandwidth usage.  This is
    particularly apparent in EPIC, as they only get three instructions in
    128 bits versus four in a traditional RISC (Although you could argue
    the longer instructions do more, but this isn't proven.).


    That is not the only way of encoding it.

    Of course.  I was merely citing a particularly bad example.


    I ended up with a 2-bit pattern in each 32-bit instruction:
       00: ?T (Conditional)
       01: ?F (Conditional)
       10: Scalar / Final
       11: WEX (Non-Final)


    So you have reduced the loss of memory efficiency to about 6% (4/32).

    Sorry (2/32)

    Certainly smaller, but still there.  BTW, I don't understand the conditionals.  On what basis do they make their decision? Are they part
    of some predication scheme?


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed May 20 00:52:25 2026
    From Newsgroup: comp.arch

    On 5/19/2026 1:22 PM, Stephen Fuld wrote:
    On 5/18/2026 10:40 PM, BGB wrote:
    On 5/18/2026 12:57 PM, Stephen Fuld wrote:
    On 5/17/2026 1:37 PM, BGB wrote:

    snip

    It is more VLIW vs In-Order, and In-Order vs OoO.

    VLIW vs In-Order:
        VLIW:
          + Slightly cheaper logic;
          - Binary depends more on processor specifics.
        In-Order:
          - Needs logic to handle register deps and lookup opcode flags.
          + Code does not depend on uArch.

    Another disadvantage is less efficient memory utilization due to
    taking up space for the template bits.  It also causes a
    correspondingly less efficient memory bandwidth usage.  This is
    particularly apparent in EPIC, as they only get three instructions in
    128 bits versus four in a traditional RISC (Although you could argue
    the longer instructions do more, but this isn't proven.).


    That is not the only way of encoding it.

    Of course.  I was merely citing a particularly bad example.


    I ended up with a 2-bit pattern in each 32-bit instruction:
       00: ?T (Conditional)
       01: ?F (Conditional)
       10: Scalar / Final
       11: WEX (Non-Final)


    So you have reduced the loss of memory efficiency to about 6% (4/32). Certainly smaller, but still there.  BTW, I don't understand the conditionals.  On what basis do they make their decision? Are they part
    of some predication scheme?


    It cost 2 bits from a 32-bit instruction, granted.


    The conditionals depended on the status of a single status bit:
    ?T: Execute if SR.T is Set, Typically No-Op if Clear (*1)
    ?F: Execute if SR.T is Clear, Typically No-Op if Set

    *1: Except for Branches, which also interact with the branch predictor.
    These turn into a "BRANOP" case, which is special:
    BRA: Take Branch
    If predicted Taken: Do Nothing (pipeline rides to new address)
    If Predicted Not Taken: Initiate a Branch
    BRANOP: Don't Take Branch
    If Predicted Taken: Initiate branch to following instruction;
    By this point, IF/ID having already gone the wrong path.
    If Predicted Not-Taken: Do Nothing (NOP).


    The SR.T bit is only modified by certain instructions.
    XG1/XG2: There were 2R and 2RI Compare Instructions.
    CMPEQ Rm, Rn //Set SR.T if Rm==Rn, Clear otherwise.
    CMPEQ Imm10un, Rn //Similar, but with an immediate.
    In XG3, these were eliminated in favor of 3R encodings.
    In the 3R encodings, If Rn/Rd is ZERO, it modifies SR.T.
    CMPEQ Rs, Rt, R0 //Do the same as the 2R case did.
    CMPEQ Rs, Imm6s, R0 //Do the same as the 2RI case

    Though, the 3R Compare encodings have a smaller immediate.



    These differ (as I understand it) from the predication in Mitch's ISA,
    which marks the next N instructions for whether or not to execute.
    This approach uses less encoding entropy, but:
    Slightly less flexible;
    Handling an exception within an instruction group would introduce
    additional required architectural state (keeping track of the execute/no-execute status for the following instructions).


    In my case, it is vaguely similar to ARM32 here, except less bits and
    cheaper to handle because no full condition codes.

    While only a single status bit could seem limiting, had noted that
    usually one only needs to predicate a single branch at a time (typically
    the innermost "if()" level).


    Explicit predication also allows the compiler to reshuffle the then/else branches together to better fit into the pipeline (though, generally the non-executed instructions still effect the pipeline flow as-if they were executed; so things like register RAW dependencies and similar still
    apply even if only the "actually executed" instructions will visibly
    manifest results at the end). So, they are like ghost instructions which
    still function as-if they were real instructions, but outputs are
    suppressed along with any other side-effects.

    While arguably it might be better if the non-executed instructions could disappear as-if they never existed in the first place (not consuming
    clock cycles or pipeline lanes), this isn't really possible with the
    existing pipeline.

    But, either way, predicating a short if/else branch is faster on average
    than having a conditional branch to skip over it (and by the time the
    limits of the predication scheme become more of an issue, it is
    typically already time to switch over to using a conditional branch).

    ...


    But, yeah, as can be noted:
    2 bit predication + 6-bit register fields does limit the amount of
    opcode space.

    As can be noted:
    F0 Block:
    The full block would contain 512 unique 3R instructions.
    1/4 of this space was originally cut off for BRA/BSR.
    So, leaving 384 total 3R ops,
    but with parts also cut off for 2R spaces.
    The original plan was to migrate this to F8,
    but this would have made other/bigger problems.
    Currently, there are around 24 3R spots used for 2R ops.
    As utilized (in XG1), but as used:
    Original Scheme: 384 ops.
    More ops via EI bit: 768 ops;
    XG3 Limit: 1536 2R ops.
    In my case, I used 2R ops more often than RISC-V does.
    F1 Block:
    Space for 32 Disp10 ops;
    There were 24 spots for Load/Store;
    Then, 8 spots used for Compare-and-Branch.
    F2 Block:
    Also a max of 32 Imm10 ops;
    Though
    5/8 (20 spots): 3RI Imm10 Spots
    So, 20 unique Imm10 spots.
    SHAD/SHLD spots got subdivided into Imm8.
    3/8: 2RI Imm10 Spots (Unused/Disallowed in XG3)
    Loss of WEX and adding Zero-Register made them all irrelevant.
    F8 Block:
    Space for 32 2RI-Imm16 ops/
    In XG1 was 8 spots, but expanded in XG3.

    The F3 and F9 blocks are not used yet, but are both planed to be
    F0-style blocks.
    Both could in theory contain another 512 3R instructions.


    As noted, I did need to be more conservative with the use of 3R and
    Imm10 ops vs RISC-V (in theory, one could fit considerably more
    3-reegister ops into RISC-V's 32-bit encoding space assuming it is not needlessly squandered).

    Though, some cases can be expanded to 64-bit encodings, for example:
    ADC Rs, Rt, Rn //Add-with-Carry
    Being a 64-bit encoding (mostly because ADC ended up not used enough to justify spending a 32-bit 3R encoding on it).


    A few niche ops could arguably make sense to be demoted to 64-bit
    encodings, say:
    LDTEX, BLKUTX*, BLKUAB*, BLINT/BLERP, ...
    Mostly because, if an instruction is only used a handful of times in a
    program (if at all) it is harder to justify a more compact encoding
    (but, may still make sense to be fast, if it is used in a tight loop).

    So, yeah, a case of instructions:
    Used rarely in terms of static usage counts;
    If used at all, often it is in the middle of a performance-sensitive
    loop (hence why they exist at all; like if rasterizing something you
    want texel fetch and interpolation to be fast, ...).




    While, in XG3, it would look like one could sneak some Imm10 ops into
    the F8 or F9 blocks, because of the way the decoder is currently
    implemented, this would result in "unholy dog chew" in the decoder (XG3
    didn't get a new decoder, rather it works on top of the existing XG1/XG2 decoders via some level "nightmare mode hackery").

    Well, partly as a consequence of making design choices intended to make
    the encoding scheme less dog chewed than had I simply gone with the least-effort options.


    There was some temptation to reduce the space used by ADDS.L and ADDU.L Imm10un encodings, as these use 4 spots as-is, but are not used often
    enough to justify this cost:
    ADDS.L could be reduced to 1 spot;
    Reducing to Imm6s or Imm6un would impact code-density though.
    ADDU.L could be reduced to Imm6s or Imm6un.
    Used infrequently enough that it wouldn't strongly effect density.

    However, changing this as-is would make a backwards compatibility mess.
    Likewise for changing MULS.L to Imm10s (currently Imm10u).
    ...

    I don't really like changing things that will break existing binaries.


    Though, strictly speaking, XOR Imm10 would be in a similar chopping
    block to ADDU.L, as both are in a similar weight-class in terms of usage-frequency and constant-distribution.



    I guess, it does exist as a thought that if I later did a core specific
    to my "BJX3" idea spec, if I should keep separate decoders for RV64 and
    XG3 ops, or try to merge them into a single super-ISA (with RV-C being
    handled as the outlier).

    Though, may make sense to have separate decoder modules as both ISAs
    also have different register fields and different immediate-field
    layouts, so there is little that could be meaningfully shared given my
    general approach to register decoders (which mostly works best with consolidating layouts and avoiding "one-offs" wherever possible).

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 20 18:25:14 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 5/19/2026 1:22 PM, Stephen Fuld wrote:
    On 5/18/2026 10:40 PM, BGB wrote:
    On 5/18/2026 12:57 PM, Stephen Fuld wrote:
    On 5/17/2026 1:37 PM, BGB wrote:

    snip

    It is more VLIW vs In-Order, and In-Order vs OoO.

    VLIW vs In-Order:
        VLIW:
          + Slightly cheaper logic;
          - Binary depends more on processor specifics.
        In-Order:
          - Needs logic to handle register deps and lookup opcode flags.
          + Code does not depend on uArch.

    Another disadvantage is less efficient memory utilization due to
    taking up space for the template bits.  It also causes a
    correspondingly less efficient memory bandwidth usage.  This is
    particularly apparent in EPIC, as they only get three instructions in >>> 128 bits versus four in a traditional RISC (Although you could argue
    the longer instructions do more, but this isn't proven.).


    That is not the only way of encoding it.

    Of course.  I was merely citing a particularly bad example.


    I ended up with a 2-bit pattern in each 32-bit instruction:
       00: ?T (Conditional)
       01: ?F (Conditional)
       10: Scalar / Final
       11: WEX (Non-Final)


    So you have reduced the loss of memory efficiency to about 6% (4/32). Certainly smaller, but still there.  BTW, I don't understand the conditionals.  On what basis do they make their decision? Are they part of some predication scheme?


    It cost 2 bits from a 32-bit instruction, granted.

    If you still have Scalar and WEX it only cost 1-bit.


    The conditionals depended on the status of a single status bit:
    ?T: Execute if SR.T is Set, Typically No-Op if Clear (*1)
    ?F: Execute if SR.T is Clear, Typically No-Op if Set

    It also costs the flags as bits.

    I should note: Predication in My 66000 costs the predicated instruction
    0-bits.
    ------------------
    Explicit predication also allows the compiler to reshuffle the then/else branches together to better fit into the pipeline (though, generally the

    We taught the LLVM compiler to always place the then-clause first and the else-clause second without finding a compiler problem.

    The HW does not switch back and forth between clauses, making register
    tracking in GBOoO fairly easy.

    non-executed instructions still effect the pipeline flow as-if they were executed; so things like register RAW dependencies and similar still

    I found this unnecessary

    apply even if only the "actually executed" instructions will visibly manifest results at the end). So, they are like ghost instructions which still function as-if they were real instructions, but outputs are
    suppressed along with any other side-effects.

    While arguably it might be better if the non-executed instructions could disappear as-if they never existed in the first place (not consuming
    clock cycles or pipeline lanes), this isn't really possible with the existing pipeline.

    Your problem, yet you see it as an advantage ?!?!

    But, either way, predicating a short if/else branch is faster on average than having a conditional branch to skip over it (and by the time the
    limits of the predication scheme become more of an issue, it is
    typically already time to switch over to using a conditional branch).

    Predication works when the FETCH unit gets to the join point before
    DECODE gets to the else-clause. When fetching 128-bits/cycle in a 1
    wide machine, this is at least 8 instructions (each clause). Wider
    machines will have correspondingly wider FETCH.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed May 20 16:55:50 2026
    From Newsgroup: comp.arch

    On 5/20/2026 1:25 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/19/2026 1:22 PM, Stephen Fuld wrote:
    On 5/18/2026 10:40 PM, BGB wrote:
    On 5/18/2026 12:57 PM, Stephen Fuld wrote:
    On 5/17/2026 1:37 PM, BGB wrote:

    snip

    It is more VLIW vs In-Order, and In-Order vs OoO.

    VLIW vs In-Order:
        VLIW:
          + Slightly cheaper logic;
          - Binary depends more on processor specifics.
        In-Order:
          - Needs logic to handle register deps and lookup opcode flags.
          + Code does not depend on uArch.

    Another disadvantage is less efficient memory utilization due to
    taking up space for the template bits.  It also causes a
    correspondingly less efficient memory bandwidth usage.  This is
    particularly apparent in EPIC, as they only get three instructions in >>>>> 128 bits versus four in a traditional RISC (Although you could argue >>>>> the longer instructions do more, but this isn't proven.).


    That is not the only way of encoding it.

    Of course.  I was merely citing a particularly bad example.


    I ended up with a 2-bit pattern in each 32-bit instruction:
       00: ?T (Conditional)
       01: ?F (Conditional)
       10: Scalar / Final
       11: WEX (Non-Final)


    So you have reduced the loss of memory efficiency to about 6% (4/32).
    Certainly smaller, but still there.  BTW, I don't understand the
    conditionals.  On what basis do they make their decision? Are they part >>> of some predication scheme?


    It cost 2 bits from a 32-bit instruction, granted.

    If you still have Scalar and WEX it only cost 1-bit.


    Well, Scalar and WEX cost a bit, granted WEX no longer exists in XG3 (as
    this part of the encoding space now maps to the RISC-V instructions,
    with these two bits having moved into the low 2 bits of the 32-bit
    instruction word).

    The other bit selects between the Scalar/WEX cases and the Predicated instructions.

    So, effectively:
    If goes from nearly the whole ISA repeating 4 times, to 3 repeats, with
    RV64G thrown in the final spot.



    The conditionals depended on the status of a single status bit:
    ?T: Execute if SR.T is Set, Typically No-Op if Clear (*1)
    ?F: Execute if SR.T is Clear, Typically No-Op if Set

    It also costs the flags as bits.

    I should note: Predication in My 66000 costs the predicated instruction 0-bits.
    ------------------
    Explicit predication also allows the compiler to reshuffle the then/else
    branches together to better fit into the pipeline (though, generally the

    We taught the LLVM compiler to always place the then-clause first and the else-clause second without finding a compiler problem.

    The HW does not switch back and forth between clauses, making register tracking in GBOoO fairly easy.


    In my case, the predication scheme allows the compiler to reshuffle instructions to pack into the pipeline more efficiently.

    Can note that ?T and ?F instructions remain ordered relative to
    unconditional instructions, but ?T and ?F do not maintain ordering
    relative to each other (even if the scheduling still needs to optimize
    for RAW dependencies, as the CPU will still end up stalling to wait for
    the results and inputs for these non-executed instructions...).


    For sake of the compiler's handling, it basically treats SR.T as a
    special virtual register for sake of the shuffling:
    The Compare uses SR.T as its virtual output register;
    All of the predicated instructions use SR.T as an input.

    So, the shuffling behaves much as if the predication were an additional register dependency.


    non-executed instructions still effect the pipeline flow as-if they were
    executed; so things like register RAW dependencies and similar still

    I found this unnecessary


    Well, in my case it is a minor drawback:
    One doesn't know whether the instruction is EX or No-EX until the EX
    stages, but by this point the pipeline flow is essentially already
    locked-in (to some extent, it is already locked in by the IF stage, but
    IF isn't going to know yet whether or not these instructions will
    actually run).


    It is also partly related to a bug I had found early on in the
    development of my core:
    Branches were sometimes failing, and it turned out to be that a register dependency from an instruction after a branch was depending on results
    from before the branch and triggering a pipeline stall, which was
    throwing off the branch mechanism.

    Ended up adding a special case handling that if a branch is in-progress,
    the mechanism to stall the pipeline and insert a bubble is temporarily disabled (well, and that taking a branch retroactively disables
    instructions which followed the branch instruction). Well, and also that
    if a fetched instruction is in the shadow of a branch, it temporarily
    disables PC increment (needed partly as inter-mode branches may need a
    few extra cycles for the ISA mode to fully transition and/or for the I$
    to realize it needs to flush the cache-lines which represent a stale mode).


    apply even if only the "actually executed" instructions will visibly
    manifest results at the end). So, they are like ghost instructions which
    still function as-if they were real instructions, but outputs are
    suppressed along with any other side-effects.

    While arguably it might be better if the non-executed instructions could
    disappear as-if they never existed in the first place (not consuming
    clock cycles or pipeline lanes), this isn't really possible with the
    existing pipeline.

    Your problem, yet you see it as an advantage ?!?!


    Well, not this part...

    This part is an annoyance, but no obvious way to make everything behave
    as if it costs 0 cycles (or, less cycles than the "normal case").



    But, as noted, for the common case where the if/else branches are only a
    few instructions, waiting for N instructions that execute but results
    are ignored, is still faster than branching over them.

    Masking off the next N instructions, if synced with fetch, could allow special-casing the fetch to do go wider. Theoretically, could also do
    this with the SR.T predication if one could verify that nothing in the pipeline could modify SR.T. Though luckily at least, this is one of the
    flags that the Superscalar fetch takes note of, but would need a good
    way to flow if forward into the following stages (well, except RF and
    EX, where this is already known via other means).

    Say:
    CMPxx Rs, Rt, X0
    OP1?T ...
    OP2?T ...
    OP3?T ...
    ...

    Would need to detect the CMP in the ID stage so as to disable
    wide-elimination in IF.

    But... This limitation would also eliminate one of the primary use-case
    of predication, as (very often) it exists as a short run of instructions directly following the compare (and, by the time IF could take advantage
    of known SR.T, most or all of the predicated instructions would have
    already completed).

    Well, or the compiler can reorder more, say:
    if(cond)
    then_stuff
    unrelated_stuff

    Becomes:
    CMPxx ...
    unrelated_stuff
    ...
    then_stuff?T
    ...
    So that the CMPxx result could reach IF in time for IF to do anything
    useful with it (then maybe have the compiler consider CMPxx and similar
    as having a 4 or 5-cycle latency WRT SR.T).



    Well, similar reason to why the compiler needs to reload the link
    register first:
    Then the RTS / JALR / etc can do something useful with it in the branch predictor.
    But, if you load then immediately JALR, the CPU is stuck and needs to go
    the slow path.


    Hmm...


    But, I am not sure how Mask-N would avoid this issue:
    Presumably the result of the 'PRED' instruction would also not be known
    until the instructions following PRED were already in the pipeline (and
    so would still need to handle the instructions as-if they were being
    executed normally until they reach the EX stages or similar).

    Well, and presumably there is no way to know the result of the PRED
    until after it goes through EX or similar.


    But, Mask-N does mean you can't really mix the then/else branches
    together, or mix them with instructions from the code following the
    then/else blocks.


    But, either way, predicating a short if/else branch is faster on average
    than having a conditional branch to skip over it (and by the time the
    limits of the predication scheme become more of an issue, it is
    typically already time to switch over to using a conditional branch).

    Predication works when the FETCH unit gets to the join point before
    DECODE gets to the else-clause. When fetching 128-bits/cycle in a 1
    wide machine, this is at least 8 instructions (each clause). Wider
    machines will have correspondingly wider FETCH.

    OK.

    In my case, the predication magic mostly happens in the EX stages.

    As noted, they mostly pass along as normal, just the side-effecting
    parts are masked away on entering EX1.

    Not sure how it worked on ARM32, probably something similar...


    In my case, most of the then/else branches seem to be 3-5 instructions
    or similar.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 20 22:51:28 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 5/20/2026 1:25 PM, MitchAlsup wrote:
    ----------------------------
    The HW does not switch back and forth between clauses, making register tracking in GBOoO fairly easy.


    In my case, the predication scheme allows the compiler to reshuffle instructions to pack into the pipeline more efficiently.

    We do not do instruction scheduling, we let the instruction queues do
    that kind of stuff. You must be caught in 1988 ...
    --------------------
    non-executed instructions still effect the pipeline flow as-if they were >> executed; so things like register RAW dependencies and similar still

    I found this unnecessary


    Well, in my case it is a minor drawback:
    One doesn't know whether the instruction is EX or No-EX until the EX
    stages, but by this point the pipeline flow is essentially already
    locked-in (to some extent, it is already locked in by the IF stage, but
    IF isn't going to know yet whether or not these instructions will
    actually run).

    Its a simple problem for Reservation Stations to handle.
    --------------
    Your problem, yet you see it as an advantage ?!?!


    Well, not this part...

    This part is an annoyance, but no obvious way to make everything behave
    as if it costs 0 cycles (or, less cycles than the "normal case").

    Key word "obvious" but then again you spell both load and store as MOV.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed May 20 23:45:39 2026
    From Newsgroup: comp.arch

    On 5/20/2026 5:51 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/20/2026 1:25 PM, MitchAlsup wrote:
    ----------------------------
    The HW does not switch back and forth between clauses, making register
    tracking in GBOoO fairly easy.


    In my case, the predication scheme allows the compiler to reshuffle
    instructions to pack into the pipeline more efficiently.

    We do not do instruction scheduling, we let the instruction queues do
    that kind of stuff. You must be caught in 1988 ...

    I am assuming in-order here.

    Shuffling instructions in the compiler leads to higher performance than
    just leaving them in whatever order they come out of the main codegen.


    Then again, AFAIK OoO didn't really hit mainstream processors until the
    late 1990s (eg, Pentium II and Pentium III).



    As for me, in 1988 I would still have been very young. I think these
    being mostly the "sit around and watch cartoons" years (with my "K12"
    years mostly spanning the 1990s and early 2000s).

    But, I am getting on in years, having existed for over 4 decades now...


    --------------------
    non-executed instructions still effect the pipeline flow as-if they were >>>> executed; so things like register RAW dependencies and similar still

    I found this unnecessary


    Well, in my case it is a minor drawback:
    One doesn't know whether the instruction is EX or No-EX until the EX
    stages, but by this point the pipeline flow is essentially already
    locked-in (to some extent, it is already locked in by the IF stage, but
    IF isn't going to know yet whether or not these instructions will
    actually run).

    Its a simple problem for Reservation Stations to handle.


    Once again, CPU is in-order.



    --------------
    Your problem, yet you see it as an advantage ?!?!


    Well, not this part...

    This part is an annoyance, but no obvious way to make everything behave
    as if it costs 0 cycles (or, less cycles than the "normal case").

    Key word "obvious" but then again you spell both load and store as MOV.


    This is an assembler design choice...

    But, then again:
    MOV EAX, DWORD PTR [EDX] //Intel style
    MOV.L 0(%EDX), %EAX //AT&T / GAS style
    MOV.L 0(%A3), %D0 //M68K style
    MOV.L @R3, R2 //SuperH style
    MOV.L (R3), R2 //style I went with
    ...
    LW X10, 0(X13) //RISC-V style

    A lot of targets still use MOV...


    The assembler in BGBCC also accepts RISC-V notation (and I almost
    considered switching over), but I mostly ended up sticking with the
    former style syntax due to inertia (and there is a "non-zero friction
    cost" related to which ASM syntax one uses).


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu May 21 16:05:10 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 5/20/2026 5:51 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/20/2026 1:25 PM, MitchAlsup wrote:
    ----------------------------
    The HW does not switch back and forth between clauses, making register >>>> tracking in GBOoO fairly easy.


    In my case, the predication scheme allows the compiler to reshuffle
    instructions to pack into the pipeline more efficiently.

    We do not do instruction scheduling, we let the instruction queues do
    that kind of stuff. You must be caught in 1988 ...

    I am assuming in-order here.

    Shuffling instructions in the compiler leads to higher performance than
    just leaving them in whatever order they come out of the main codegen.


    Then again, AFAIK OoO didn't really hit mainstream processors until the
    late 1990s (eg, Pentium II and Pentium III).

    Please check your history!

    The PentiumPro was the first ever mass-market OoO CPU, it arrived around
    1996.

    PentiumI/III/MMX were just variation on the original Pentium which
    introduced superscalar in the form of the u and v pipes which could
    execute two instructions at once, IFF you aligned them properly and
    selected a simple instruction for the v pipe.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu May 21 19:45:47 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 16:05:10 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    BGB wrote:
    On 5/20/2026 5:51 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/20/2026 1:25 PM, MitchAlsup wrote:
    ----------------------------
    The HW does not switch back and forth between clauses, making
    register tracking in GBOoO fairly easy.


    In my case, the predication scheme allows the compiler to
    reshuffle instructions to pack into the pipeline more
    efficiently.

    We do not do instruction scheduling, we let the instruction queues
    do that kind of stuff. You must be caught in 1988 ...

    I am assuming in-order here.

    Shuffling instructions in the compiler leads to higher performance
    than just leaving them in whatever order they come out of the main
    codegen.


    Then again, AFAIK OoO didn't really hit mainstream processors until
    the late 1990s (eg, Pentium II and Pentium III).

    Please check your history!

    The PentiumPro was the first ever mass-market OoO CPU, it arrived
    around 1996.

    PentiumI/III/MMX were just variation on the original Pentium which introduced superscalar in the form of the u and v pipes which could
    execute two instructions at once,

    Pentium-MMX (P55C) is indeed variant of Pentium, with not insignificant microarchitectural changes (1 stage longer piplene, sligtly released
    pairing rules, different dedoder that does not depend on instruction
    bounderies marks in the cache).
    But Pentium III is P6, next generation of Pentium II with almost
    identical core uArch.

    IFF you aligned them properly and
    selected a simple instruction for the v pipe.

    Terje


    P6 had its own problems, not with execution phase but with decoders.
    Google for 4-1-1.






    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Wed May 20 17:56:16 2026
    From Newsgroup: comp.arch

    On 5/16/26 2:23 PM, MitchAlsup wrote:
    [snip]
    When you do not have condition codes, and only 1 register file, you
    can determine parallel-ness by simply looking at the registers.

    Not having different instructions produce results for different
    parts of a single condition code also helps. Using a fixed
    instruction bit to indicate setting of the condition code and
    another bit to indicate consumption would also make dependency
    detection easier at the cost of code density (and other issues
    with single-size instructions). Even a random-read, circular-
    buffer-write condition code queue (where the write is implicit
    and the read is explicit) would have easier dependency checking
    than checking for dependencies for x86 condition codes.

    Even with a single register set and fixed register name
    positions in instructions, there would be some need to decode
    the instruction enough to determine how many sources are used
    and if a destination is used.

    I got the impression that the comparison overhead for wider
    dependency checks was "intimidating" for early processors. Two
    wide with only two-source instructions only requires two
    comparisons, but four wide requires 12 comparisons.

    Of course, all of the above is fairly basic computer
    architecture theory, probably known to all the posters here and
    almost all of the readers. I guess I am feeling a bit cranky to
    post common knowledge to refine a general statement about
    condition codes.

    (I do feel that such comparison work and any deduplication and
    RAT bank checking could be cached either at L1 or in a more
    persistent layer, perhaps even the in-RAM executable format.)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 21 18:30:33 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 5/20/2026 5:51 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/20/2026 1:25 PM, MitchAlsup wrote:
    ----------------------------
    The HW does not switch back and forth between clauses, making register >>> tracking in GBOoO fairly easy.


    In my case, the predication scheme allows the compiler to reshuffle
    instructions to pack into the pipeline more efficiently.

    We do not do instruction scheduling, we let the instruction queues do
    that kind of stuff. You must be caught in 1988 ...

    I am assuming in-order here.

    Shuffling instructions in the compiler leads to higher performance than
    just leaving them in whatever order they come out of the main codegen.


    Then again, AFAIK OoO didn't really hit mainstream processors until the
    late 1990s (eg, Pentium II and Pentium III).



    As for me, in 1988 I would still have been very young. I think these
    being mostly the "sit around and watch cartoons" years (with my "K12"
    years mostly spanning the 1990s and early 2000s).

    But, I am getting on in years, having existed for over 4 decades now...


    --------------------
    non-executed instructions still effect the pipeline flow as-if they were >>>> executed; so things like register RAW dependencies and similar still

    I found this unnecessary


    Well, in my case it is a minor drawback:
    One doesn't know whether the instruction is EX or No-EX until the EX
    stages, but by this point the pipeline flow is essentially already
    locked-in (to some extent, it is already locked in by the IF stage, but
    IF isn't going to know yet whether or not these instructions will
    actually run).

    Its a simple problem for Reservation Stations to handle.


    Once again, CPU is in-order.



    --------------
    Your problem, yet you see it as an advantage ?!?!


    Well, not this part...

    This part is an annoyance, but no obvious way to make everything behave
    as if it costs 0 cycles (or, less cycles than the "normal case").

    Key word "obvious" but then again you spell both load and store as MOV.


    This is an assembler design choice...

    But, then again:
    MOV EAX, DWORD PTR [EDX] //Intel style
    MOV.L 0(%EDX), %EAX //AT&T / GAS style
    MOV.L 0(%A3), %D0 //M68K style
    MOV.L @R3, R2 //SuperH style
    MOV.L (R3), R2 //style I went with
    ...
    LW X10, 0(X13) //RISC-V style
    // style BGB inherited

    A lot of targets still use MOV...

    MOV implies that the data is unaltered, while LD implies the memory
    value is expanded to fill out the whole register, and ST implies
    the register values is chopped off to fit in the memory container.
    Expansion means sign or zero extend, chop means bits are ignored.


    The assembler in BGBCC also accepts RISC-V notation (and I almost
    considered switching over), but I mostly ended up sticking with the
    former style syntax due to inertia (and there is a "non-zero friction
    cost" related to which ASM syntax one uses).


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu May 21 13:57:14 2026
    From Newsgroup: comp.arch

    On 5/21/2026 9:05 AM, Terje Mathisen wrote:
    BGB wrote:
    On 5/20/2026 5:51 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/20/2026 1:25 PM, MitchAlsup wrote:
    ----------------------------
    The HW does not switch back and forth between clauses, making register >>>>> tracking in GBOoO fairly easy.


    In my case, the predication scheme allows the compiler to reshuffle
    instructions to pack into the pipeline more efficiently.

    We do not do instruction scheduling, we let the instruction queues do
    that kind of stuff. You must be caught in 1988 ...

    I am assuming in-order here.

    Shuffling instructions in the compiler leads to higher performance
    than just leaving them in whatever order they come out of the main
    codegen.


    Then again, AFAIK OoO didn't really hit mainstream processors until
    the late 1990s (eg, Pentium II and Pentium III).

    Please check your history!

    The PentiumPro was the first ever mass-market OoO CPU, it arrived around 1996.

    PentiumI/III/MMX were just variation on the original Pentium which introduced superscalar in the form of the u and v pipes which could
    execute two instructions at once, IFF you aligned them properly and
    selected a simple instruction for the v pipe.


    OK.

    Either way, basic point still stands:
    1996 is still much later than 1988...


    Though, looking stuff up:
    Seems PentiumPro was intended for the workstation market;
    Its market share was (AFAIK) not as big as the Pentium II or III.
    Whereas Pentium II was a more consumer marketed chip,
    and widely sold.

    Stuff I am reading says that PentiumPro, Pentium II, and Pentium III
    were all based on the P6 architecture.

    In contrast to the Pentium I and Pentium MMX, which were P5.
    Or Pentium 4, which was NetBurst.

    Then Intel later brought back a P6 variant for the Intel Core Micro-Architecture, after NetBurst was an epic fail.




    Though does remind me:
    I did a check and if the superscalar fetch for RV and XG3 is able to
    swap Lane 1 and Lane 2 instructions in the case where a Lane 1 only op
    would have been in Lane 2 in a potential 2-wide fetch, this increases
    IPB (Instructions per Bundle) from 1.14 to 1.20.

    Though, that this happens indicates that likely there are either
    instruction scheduling mismatches between the compiler and CPU's models,
    or the compiler is settling on scheduling that does not maximize IPB in
    the more naive case.

    Granted, maybe this is weak, and maybe not enough to justify the 6R3W
    GPR file and (ALU-only) Lane 3.

    Though, indexed store is semi-common, and with a 4R2W regfile this
    becomes a scalar instruction.


    That said, if I decide to start on a JX3 core, may need to decide on a
    few things:
    4R2W, 6R2W, or 6R3W
    First 2 are if I drop to 2-wide superscalar.
    Whether to stay with my existing RingBus,
    or evaluate other possibilities.
    What exact stuff to prune.

    That said, the likely easiest route would be to start a JX3 core as a
    stripped down JX2 core and keep most of the same general infrastructure, rather than a ground-up rewrite.


    Though, I am likely to make a more significant redesign of the IF and ID stages.


    And maybe try to come up with some way that I can both do superscalar
    tagging during cache-line ingest while also resolving the cache line
    boundary issues (can't go superscalar across a cache-line if one can't
    see what is in the following instructions, and in this case the status
    of an instruction at the start of a cache-line would depend on the
    status of the instruction at the end of the preceding line).

    Though, another strategy might be to instead do the tagging post-ingest,
    but then have a "tagging miss" case where it figures out the tagging and writes it back into the tagging bits;
    Address Miss:
    Fetch cache lines from RAM;
    Address Hit, Tag Miss:
    Recalc Superscalar stuff;
    Write updated tag bits back into I$;
    At this stage, we can see both current and following cache lines.

    So:
    + Avoids the cache-line crossing limit;
    + Avoids the latency cost of doing all this during fetch.
    - Adds additional miss and write-back logic.
    - Would still be weak against the "first instruction in line" issue

    Though, another option could be to do it primarily at ingest time, and
    use post-ingest (or tag miss) specifically for fixing up the cache line crossing cases. Can address both issues, but is more convoluted.

    ...


    Terje


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu May 21 14:21:42 2026
    From Newsgroup: comp.arch

    On 5/21/2026 1:30 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/20/2026 5:51 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/20/2026 1:25 PM, MitchAlsup wrote:
    ----------------------------
    The HW does not switch back and forth between clauses, making register >>>>> tracking in GBOoO fairly easy.


    In my case, the predication scheme allows the compiler to reshuffle
    instructions to pack into the pipeline more efficiently.

    We do not do instruction scheduling, we let the instruction queues do
    that kind of stuff. You must be caught in 1988 ...

    I am assuming in-order here.

    Shuffling instructions in the compiler leads to higher performance than
    just leaving them in whatever order they come out of the main codegen.


    Then again, AFAIK OoO didn't really hit mainstream processors until the
    late 1990s (eg, Pentium II and Pentium III).



    As for me, in 1988 I would still have been very young. I think these
    being mostly the "sit around and watch cartoons" years (with my "K12"
    years mostly spanning the 1990s and early 2000s).

    But, I am getting on in years, having existed for over 4 decades now...


    --------------------
    non-executed instructions still effect the pipeline flow as-if they were >>>>>> executed; so things like register RAW dependencies and similar still >>>>>
    I found this unnecessary


    Well, in my case it is a minor drawback:
    One doesn't know whether the instruction is EX or No-EX until the EX
    stages, but by this point the pipeline flow is essentially already
    locked-in (to some extent, it is already locked in by the IF stage, but >>>> IF isn't going to know yet whether or not these instructions will
    actually run).

    Its a simple problem for Reservation Stations to handle.


    Once again, CPU is in-order.



    --------------
    Your problem, yet you see it as an advantage ?!?!


    Well, not this part...

    This part is an annoyance, but no obvious way to make everything behave >>>> as if it costs 0 cycles (or, less cycles than the "normal case").

    Key word "obvious" but then again you spell both load and store as MOV.


    This is an assembler design choice...

    But, then again:
    MOV EAX, DWORD PTR [EDX] //Intel style
    MOV.L 0(%EDX), %EAX //AT&T / GAS style
    MOV.L 0(%A3), %D0 //M68K style
    MOV.L @R3, R2 //SuperH style
    MOV.L (R3), R2 //style I went with
    ...
    LW X10, 0(X13) //RISC-V style
    // style BGB inherited

    A lot of targets still use MOV...

    MOV implies that the data is unaltered, while LD implies the memory
    value is expanded to fill out the whole register, and ST implies
    the register values is chopped off to fit in the memory container.
    Expansion means sign or zero extend, chop means bits are ignored.


    Despite all of the drastic redesigns in other areas, I was more
    conservative of changes that meant needing to go back and rewrite all of
    my existing ASM.

    The change from @R3 to (R3) was mostly cosmetic, but assembler accepts both.


    XG3 was the biggest jump in this area, but this was primarily due to
    switching over to the RISC-V ABI, and deciding whether to switch over to
    the canonical RISC-V ASM style (but augmented for the new instructions
    in XG3), or stick to the former ASM style (which I had already ended up
    using for RISC-V in BGBCC due to inertia).


    Technically, the compiler will accept writing:
    MOV.L @R10+, R11
    In RISC-V mode, even if it is both unorthodox and will break apart into
    2 instructions.


    If there is really a strong need to revisit the ASM style, could maybe
    do so.


    But, alas, this is mostly along similar lines to trying to address
    things like the inconsistent use of defines for instruction mnemonics
    within the compiler.

    In many cases, the mnemonics for some of the multiply instructions are effectively backwards inside the compiler, owing to how I was using the multiply ops on SuperH.

    Various instructions have different macro-mnemonics from what I ended up eventually going with.

    Some ops, like TST/TSTN are backwards between 2R and 3R variants.
    TST (2R): Set SR.T if (Rm&Rn)==0
    TST (3R): Set result if (Rm&Rn)!=0

    Well, and XG3 currently lacks a way to encode MOVT and MOVNT (short of bringing back XG1/XG2 1R encodings, which I don't want to do).

    Can sorta fake MOVNT with CSELT and a jumbo prefix, which is a 64-bit encoding, but it isn't commonly used.


    Etc...



    The assembler in BGBCC also accepts RISC-V notation (and I almost
    considered switching over), but I mostly ended up sticking with the
    former style syntax due to inertia (and there is a "non-zero friction
    cost" related to which ASM syntax one uses).



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri May 22 14:37:51 2026
    From Newsgroup: comp.arch

    Michael S wrote:
    On Thu, 21 May 2026 16:05:10 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    BGB wrote:
    On 5/20/2026 5:51 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/20/2026 1:25 PM, MitchAlsup wrote:
    ----------------------------
    The HW does not switch back and forth between clauses, making
    register tracking in GBOoO fairly easy.


    In my case, the predication scheme allows the compiler to
    reshuffle instructions to pack into the pipeline more
    efficiently.

    We do not do instruction scheduling, we let the instruction queues
    do that kind of stuff. You must be caught in 1988 ...

    I am assuming in-order here.

    Shuffling instructions in the compiler leads to higher performance
    than just leaving them in whatever order they come out of the main
    codegen.


    Then again, AFAIK OoO didn't really hit mainstream processors until
    the late 1990s (eg, Pentium II and Pentium III).

    Please check your history!

    The PentiumPro was the first ever mass-market OoO CPU, it arrived
    around 1996.

    PentiumI/III/MMX were just variation on the original Pentium which
    introduced superscalar in the form of the u and v pipes which could
    execute two instructions at once,

    Pentium-MMX (P55C) is indeed variant of Pentium, with not insignificant microarchitectural changes (1 stage longer piplene, sligtly released
    pairing rules, different dedoder that does not depend on instruction bounderies marks in the cache).
    But Pentium III is P6, next generation of Pentium II with almost
    identical core uArch.

    OK, Mea Culpa! I have obviously started to forget all the various names
    that Intel applied over the years. :-(


    IFF you aligned them properly and
    selected a simple instruction for the v pipe.

    P6 had its own problems, not with execution phase but with decoders.
    Google for 4-1-1.

    The 4-1-1 rule simply said that you could only have a single "big"
    opcode in a deciding bundle, and it had to be the first, otherwise
    decoding would halt at that point and restart in the next cycle.

    I cannot remember ever observing this to be an issue for my own asm
    code, but I tend to write simple load/store style code due to previous
    Pentium efforts where that could make a big difference.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri May 22 14:44:38 2026
    From Newsgroup: comp.arch

    BGB wrote:
    Either way, basic point still stands:
      1996 is still much later than 1988...


    Though, looking stuff up:
      Seems PentiumPro was intended for the workstation market;
        Its market share was (AFAIK) not as big as the Pentium II or III.
      Whereas Pentium II was a more consumer marketed chip,
        and widely sold.

    Stuff I am reading says that PentiumPro, Pentium II, and Pentium III
    were all based on the P6 architecture.
    You were right, Michael S already showed me that I was wrong re
    PentiumII/III.

    In contrast to the Pentium I and Pentium MMX, which were P5.
      Or Pentium 4, which was NetBurst.

    Then Intel later brought back a P6 variant for the Intel Core Micro-Architecture, after NetBurst was an epic fail.
    The P4 was an amazing design, running the core at 2 X the official clock speed but only doing half an operation per half-cycle.
    It ran like the proverbial bat out of hell when everything aligned
    properly, but slammed into a wall every time it had to leave the fast
    inner core and switch to normal processing, i.e for stuff like SHR.
    Afair, it also blew up integer MUL at the same time as shifts became
    much slower, which meant that many previously cheap addressing
    operations now because much slower.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri May 22 09:27:22 2026
    From Newsgroup: comp.arch

    On 2026-May-22 08:44, Terje Mathisen wrote:
    BGB wrote:
    Either way, basic point still stands:
       1996 is still much later than 1988...


    Though, looking stuff up:
       Seems PentiumPro was intended for the workstation market;
         Its market share was (AFAIK) not as big as the Pentium II or III. >>    Whereas Pentium II was a more consumer marketed chip,
         and widely sold.

    Stuff I am reading says that PentiumPro, Pentium II, and Pentium III were all based on the P6 architecture.

    You were right, Michael S already showed me that I was wrong re PentiumII/III.

    In contrast to the Pentium I and Pentium MMX, which were P5.
       Or Pentium 4, which was NetBurst.

    Then Intel later brought back a P6 variant for the Intel Core Micro-Architecture, after NetBurst was an epic fail.

    The P4 was an amazing design, running the core at 2 X the official clock speed but only doing half an operation per half-cycle.

    It ran like the proverbial bat out of hell when everything aligned properly, but slammed into a wall every time it had to leave the fast inner core and switch to normal processing, i.e for stuff like SHR.

    Afair, it also blew up integer MUL at the same time as shifts became much slower, which meant that many previously cheap addressing operations now because much slower.

    Terje


    It sounds like you are referring to what some called a "replay storm"
    or "replay cyclone".

    Replay: Unknown Features of the NetBurst Core https://web.archive.org/web/20160306140603/http://www.xbitlabs.com/articles/cpu/print/replay.html



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 22 17:46:15 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    BGB wrote:
    Either way, basic point still stands:
      1996 is still much later than 1988...


    Though, looking stuff up:
      Seems PentiumPro was intended for the workstation market;
        Its market share was (AFAIK) not as big as the Pentium II or III.
      Whereas Pentium II was a more consumer marketed chip,
        and widely sold.

    Stuff I am reading says that PentiumPro, Pentium II, and Pentium III
    were all based on the P6 architecture.

    You were right, Michael S already showed me that I was wrong re PentiumII/III.

    In contrast to the Pentium I and Pentium MMX, which were P5.
      Or Pentium 4, which was NetBurst.

    Then Intel later brought back a P6 variant for the Intel Core Micro-Architecture, after NetBurst was an epic fail.

    The P4 was an amazing design, running the core at 2 X the official clock speed but only doing half an operation per half-cycle.

    At close to 2.5× the power.

    It ran like the proverbial bat out of hell when everything aligned
    properly, but slammed into a wall every time it had to leave the fast
    inner core and switch to normal processing, i.e for stuff like SHR.

    Afair, it also blew up integer MUL at the same time as shifts became
    much slower, which meant that many previously cheap addressing
    operations now because much slower.

    Terje

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 23 09:49:10 2026
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    The P4 was an amazing design, running the core at 2 X the official clock =

    speed but only doing half an operation per half-cycle.

    That was only the Willamette and Northwood. I tested it, and it could
    really do two dependent additions per cycle. The Prescott and Cedar
    Mill variants, however, had a single-cycle adder.

    It ran like the proverbial bat out of hell when everything aligned=20 >properly, but slammed into a wall every time it had to leave the fast=20 >inner core and switch to normal processing, i.e for stuff like SHR.

    Afair, it also blew up integer MUL at the same time as shifts became=20
    much slower, which meant that many previously cheap addressing=20
    operations now because much slower.

    The Willamette (and Northwood) performs multiplies and shifts in the
    FPU, and as somebody here explained at the time, it crosses several
    clock domains in the transfer, losing a cycle on every crossing in
    each direction.

    However, the not-so-great IPC is reportedly mainly thanks to the
    replay storms that EricP mentions. For various cases where an
    instruction was issued before its operands were ready (e.g., due to a
    cache miss, or because the functional unit that should compute the
    operand was occupied), these instructions were replayed. From what I
    read, their simulator did not model the replays and they assumed that
    they would play a minor role, but apparently replays caused additional
    replays, resulting in replay storms. But I have not found a good
    explanation on what happens there.

    Anyway, if you read <https://chipsandcheese.com/p/intels-netburst-failure-is-a-foundation-for-success>,
    it points out that the Pentium 4 pioneered many of the techniques that
    were implemented in Sandy Bridge (2011), but they managed to avoid the
    pitfalls of the Pentium 4 in Sandy Bridge, and the Sandy Bridge was
    very successul.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 23 17:44:08 2026
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Replay: Unknown Features of the NetBurst Core >https://web.archive.org/web/20160306140603/http://www.xbitlabs.com/articles/cpu/print/replay.html

    Yes, xbitlabs was a good site. Apparently there is not enough
    interest in these topics to support such sites when they try to do it professionally.

    Anyway, yes, that's a detailed description of replays, and now I
    remember that I have already seen this (certainly I remember seeing
    something about RL-7 and RL-12).. Unfortunately, in a number of the
    more involved places it is not clear enough (or maybe I invested not
    enough time into understanding it), so I only took away a general
    impression.

    In any case, one thing I wonder about is the microbenchmark that
    results in Pic. 1: With just dependent loads (and no adds), it already
    has 19 cycles of latency instead of the 9 cycles of latency to L2.
    How come? Ok, the second load takes one round through the replay
    loop, and in a long line of loads, serveral loads will travel through
    RL-7 at all times. And I guess, some will cause a delay of a load
    entering the dispatch, but that should add one cycle of latency now
    and then. How do we get 19? The increase to 32 cycles if one add is
    involved also seems extreme. Somehow I still have not figured out
    some important part of these replay loops.

    Then again, with 9 cycles of L2 latency and 40 0.5-cycle adds in one
    dependence chain, I would expect a minimum latency of 27 cycles, but
    Pic.1 shows <22 cycles. How come?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sat May 23 18:16:56 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 16:05:10 +0200, Terje Mathisen wrote:
    BGB wrote:

    Then again, AFAIK OoO didn't really hit mainstream processors until the
    late 1990s (eg, Pentium II and Pentium III).

    Please check your history!

    The PentiumPro was the first ever mass-market OoO CPU, it arrived around 1996.

    You're both right.

    The Pentium II, of course, was basically a minor redesign of the Pentium
    Pro, intended to improve performance on 16-bit code, and lower
    manufacturing costs by going to a half-speed cache that didn't require an elaborate direct connection to the main die.

    However, the Pentium II was a "mainstream processor", while the Pentium
    Pro was expensive, exotic, and really only sold to commercial customers.
    The Pentium II, on the other hand, was in ordinary PCs sold to home users.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 23 19:58:25 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Replay: Unknown Features of the NetBurst Core >https://web.archive.org/web/20160306140603/http://www.xbitlabs.com/articles/cpu/print/replay.html

    Yes, xbitlabs was a good site. Apparently there is not enough
    interest in these topics to support such sites when they try to do it professionally.

    Anyway, yes, that's a detailed description of replays, and now I
    remember that I have already seen this (certainly I remember seeing
    something about RL-7 and RL-12).. Unfortunately, in a number of the
    more involved places it is not clear enough (or maybe I invested not
    enough time into understanding it), so I only took away a general
    impression.

    In any case, one thing I wonder about is the microbenchmark that
    results in Pic. 1: With just dependent loads (and no adds), it already
    has 19 cycles of latency instead of the 9 cycles of latency to L2.
    How come?

    Memory order without some kind of memory order prediction.
    When one LD takes a miss, all younger LDs have to replay;
    sometimes, backing DECODE back up to the point of insertion
    into Station.

    Ok, the second load takes one round through the replay
    loop, and in a long line of loads, serveral loads will travel through
    RL-7 at all times. And I guess, some will cause a delay of a load
    entering the dispatch, but that should add one cycle of latency now
    and then. How do we get 19? The increase to 32 cycles if one add is involved also seems extreme. Somehow I still have not figured out
    some important part of these replay loops.

    Then again, with 9 cycles of L2 latency and 40 0.5-cycle adds in one dependence chain, I would expect a minimum latency of 27 cycles, but
    Pic.1 shows <22 cycles. How come?

    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2