• Re: Misc: BGBCC targeting RV64G, initial results...

    From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Wed Oct 16 11:59:57 2024
    From Newsgroup: comp.arch

    On 9/30/24 1:52 AM, MitchAlsup1 wrote:
    On Sat, 28 Sep 2024 1:44:10 +0000, Paul A. Clayton wrote:
    [snip]
    Another weird concept that came to mind would be providing an
    8-bit (e.g.) field that enumerated a set of interesting
    conditions.

    I use a 64-bit container of conditions

    A enumeration of conditions is different from a bitmask of
    conditions. An enumeration could support N-way branching in a
    single instruction rather than a tree of single bit-condition
    branches.

    My 66000's compare result has unused space for multiple such
    enumerations.

    I do not know of any enumeration of conditions that would be
    commonly useful. Less than, equal to, greater than might be
    somewhat useful for a three-way branch. Relation to zero as well
    as an explicit comparison value might be useful for some multi-way
    choices.

    Lack of density is also a problem for multi-way branches; the
    encoding will waste space if multiple enumerated states share a
    target.

    The concept seemed worth mentioning even if I thought it unlikely
    to be practically useful.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 16 11:07:08 2024
    From Newsgroup: comp.arch

    On 10/16/2024 8:59 AM, Paul A. Clayton wrote:

    snip

    I do not know of any enumeration of conditions that would be
    commonly useful. Less than, equal to, greater than might be
    somewhat useful for a three-way branch.

    That was the function of the arithmetic if statement in original
    Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 19:11:02 2024
    From Newsgroup: comp.arch

    On Wed, 16 Oct 2024 15:59:57 +0000, Paul A. Clayton wrote:

    On 9/30/24 1:52 AM, MitchAlsup1 wrote:
    On Sat, 28 Sep 2024 1:44:10 +0000, Paul A. Clayton wrote:
    [snip]
    Another weird concept that came to mind would be providing an
    8-bit (e.g.) field that enumerated a set of interesting
    conditions.

    I use a 64-bit container of conditions

    A enumeration of conditions is different from a bitmask of
    conditions. An enumeration could support N-way branching in a
    single instruction rather than a tree of single bit-condition
    branches.

    My 66000's compare result has unused space for multiple such
    enumerations.

    One can "do" 3-way branching as is:: CMP-BC1-BC2-other

    I do not know of any enumeration of conditions that would be
    commonly useful. Less than, equal to, greater than might be
    somewhat useful for a three-way branch. Relation to zero as well
    as an explicit comparison value might be useful for some multi-way
    choices.

    3-way branches are out of style:: Fortran disinherited them
    while IEEE 754 made them need to be 4-way (NaN).

    Lack of density is also a problem for multi-way branches; the
    encoding will waste space if multiple enumerated states share a
    target.

    The concept seemed worth mentioning even if I thought it unlikely
    to be practically useful.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 16 19:17:17 2024
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 10/16/2024 8:59 AM, Paul A. Clayton wrote:

    snip

    I do not know of any enumeration of conditions that would be
    commonly useful. Less than, equal to, greater than might be
    somewhat useful for a three-way branch.

    That was the function of the arithmetic if statement in original
    Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.

    Not so long ago, actually, it was only dropped in Fortran 2018.
    I actually think that this is a bad idea, compilers will continue
    to support such features, but possible interactions with other
    features will no longer be properly defined.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 16 15:23:08 2024
    From Newsgroup: comp.arch

    On 10/16/2024 1:07 PM, Stephen Fuld wrote:
    On 10/16/2024 8:59 AM, Paul A. Clayton wrote:

    snip

    I do not know of any enumeration of conditions that would be
    commonly useful. Less than, equal to, greater than might be
    somewhat useful for a three-way branch.

    That was the function of the arithmetic if statement in original
    Fortran.  If it were more useful, it wouldn't have been taken out of the language long ago.


    Yeah...

    Ironically, one of the main arguable use-cases for old Fortran style IF statements is implementing the binary dispatch logic in a binary
    subdivided "switch()", but not enough to justify having a dedicated instruction for it.

    Say:
    MOV Imm, Rt //pivot case
    BLT Rt, Rx, .lbl_lo
    BGT Rt, Rx, .lbl_hi
    BRA .lbl_case

    But, absent having multiple labels per branch, not really a good way to
    save much over this...



    Otherwise, had recently been still working on BGBCC+RV stuff:
    Trying to getting stuff working correctly in my Verilog implementation.
    There are still some bugs here.

    Writing a spec for a "low-cost" FPU SIMD extension:
    https://pastebin.com/9UeAP9Yk

    Which basically just takes the arguably cheaper route of "extend the F,
    D, and Zfh extensions to support basic FPU-SIMD in the existing FPRs"
    rather than "define a whole new complicated mess of stuff" that is the V extension.

    Some details are still in-flux, and I have not yet decided whether or
    not to map over the FP8 converter ops and similar. Arguably FP8 and
    A-Law converter ops are a bit niche though.


    As well as looking some at the P spec, which (ignoring the needlessly complicated parts) isn't too far from what BJX2 does SIMD wise (albeit
    lacks obvious direct equivalents of the RGB555 helper instructions; but possibly using SIMD to work with RGB555 pixel data is a bit niche).


    It is possible if I add some of this, I may do it as jumbo-prefix-only
    ops. One is unlikely to see RGB555 or FP8 converters used in any
    significant density (except maybe if doing highly-unrolled NN code using
    FP8 or similar; but unclear if it would try to make sense to map this
    over to RV anyways; and existing people trying to do stuff in this area
    appear to be mostly focused on the V extension).

    For normal graphical or audio processing, having these sorts of niche converters as 64-bit encodings would probably be fine.


    As-is, it could do a 4x32 shuffle in 2 instructions, but would need
    either a 4-op sequence (no jumbo), or a jumbo-encoded op, to perform a
    4x16 shuffle (it is either this or define a dedicated "FPSHUF.H"
    instruction or similar). Can probably assume, if it matters, will
    probably also have a jumbo prefix.

    May still need to decide on some other things, like whether to map over
    a jumbo-encoded 4xFP8 to 4xFP16 constant-load. Or, whether to come up
    with an encoding to load an arbitrary 64-bit value into an FPR
    (currently N/E in RV64 mode).

    As-is:
    J22+J22+LUI : LI Xn, Imm64
    J22+J22+AUIPC: Unused
    J22+J22+JAL : Unused, Possible "JAL Rn, Abs64"


    For FPR's, in may make sense to have:
    Load Binary16, expanding to Binary64 (already in Jumbo spec)
    Load Imm33s into low-order bits (Jumbo spec, J12O+LUI)
    Load Imm32 into high-order bits
    Possible, not yet defined, already exists in BJX2 (1).
    Load Imm32 as 2xFP16 expanding to 2xFP32
    Possible, not yet defined, already exists in BJX2 (1).
    Load Imm32 as 4xFP8 expanding to 4xFP16
    Possible, not yet defined, already exists in BJX2 (1).

    *1: Probably could define it as J12O+LUI, using the Wm and Wo register-extension bits to encode which type of constant to load
    (basically about the same as how I did it in BJX2; just it had used a
    J_OP and "MOV Imm16u, Rn" instruction instead, but similar basic idea here).
    Probably, say:
    00: Load Imm33s to low 32-bits, sign-extend as usual
    01: Load Imm32 to high 32-bits (sign bit used for LSB fill, *2)
    10: 2xFP16 -> 2xFP32
    11: 4xFP8 -> 4xFP16

    *2: Though in BJX2, this case was encoded as J_IMM+"LDIHI Imm10, Rn"

    Could maybe be tempted to reclaim "J22+J22+AUIPC" as:
    LI Fn, Imm64
    Arguing that, if one needs PC-rel, +/- 4GB is sufficient; and one is far
    more likely to want to be able to load constants like M_PI and similar
    into an FPU register (in a single clock-cycle).

    Though, if one has this, the other constant cases (2xFP16 or 4xFP8)
    would be merely space-saving (mostly relevant to FP-SIMD vector
    literals), but may be lower priority mostly as they are infrequently
    used (and thus the space savings are less significant).

    Relative cost-difference is small, if one assumes an implementation
    where the constant-load cases use the same converters as used for the
    normal vector conversion path, which would be (presumably) already present.


    Most of this would be largely irrelevant to Doom performance, but would
    be relevant if I want to try to make GLQuake work at some semblance of
    usable in RV Mode.

    Less immediate relevance to SW Quake, which uses mostly scalar FPU (and
    mostly naively represents vectors as in-memory pointers).



    In this stuff, I have also started running into annoyance of noting differences and additions/removals/changed between different versions of
    the BitManip spec / B extension. A few useful ops were removed in newer versions, ...

    My Jumbo prefix encoding would have conflicted with an earlier version
    of BitManip, but does not conflict with the current form of the B
    extension (it exists in the shadow of previously-removed instructions).


    Felt curious and looked, it looks like the person mostly responsible for
    the B extension has largely "gone quiet" for the past year or so (no
    recent social media posts, has seemingly taken down all of their past
    YouTube and Twitch contents; minimal activity on GitHub). Not entirely
    sure what is going on there.

    ...


    Otherwise, did see a video talking some about performance of Doom and
    Quake and similar on older systems:
    Doom apparently required something like a 486 DX2-66 to perform well.
    Quake apparently required a faster Pentium system to be playable.
    Apparently, likewise for Hexen, ...
    Apparently Wolf3D needed a higher-end 386 to perform well.
    Even if it could technically run on a 286.
    ...

    I guess this differs from my prior understanding that Doom would have
    been mostly playable on a 25 MHz 386 or similar. Apparently, not really.


    So, I guess I can feel not quite as bad about the lackluster framerates
    from Quake and Hexen on a 50MHz core. Seemingly, it is in-fact still outperforming vintage (early 90s) PCs.

    Well, and Quake3 is pretty slow, but IIRC, PCs of that era were
    generally pushing 1GHz, so...


    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 22:16:29 2024
    From Newsgroup: comp.arch

    On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:


    Ironically, one of the main arguable use-cases for old Fortran style IF statements is implementing the binary dispatch logic in a binary
    subdivided "switch()", but not enough to justify having a dedicated instruction for it.

    Say:
    MOV Imm, Rt //pivot case
    BLT Rt, Rx, .lbl_lo
    BGT Rt, Rx, .lbl_hi
    BRA .lbl_case

    With a 64-bitinstruction one could do::

    B3W .lbl_lo,.lbl_zero,.lbl_hi

    rather straightforwardly.....
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 16 19:03:26 2024
    From Newsgroup: comp.arch

    On 10/16/2024 5:16 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:


    Ironically, one of the main arguable use-cases for old Fortran style IF
    statements is implementing the binary dispatch logic in a binary
    subdivided "switch()", but not enough to justify having a dedicated
    instruction for it.

    Say:
       MOV  Imm, Rt  //pivot case
       BLT  Rt, Rx, .lbl_lo
       BGT  Rt, Rx, .lbl_hi
       BRA  .lbl_case

    With a 64-bitinstruction one could do::

        B3W   .lbl_lo,.lbl_zero,.lbl_hi

    rather straightforwardly.....

    Possibly, but the harder part would be to deal with decoding and feeding
    the instruction through the pipeline.

    Granted, I guess it could be decoded as if it were a normal 3RI op or
    similar, but then split up the immediate into multiple parts in EX1.


    Say:
    Decode as a 3RI Imm33s;
    Then split the immediate into 3x 11-bits, calculate 3 offsets relative
    to PC, and apply the one which matches the result of the comparison
    (likely needing to route the S and Z flags from the subtract logic to
    EX1 or similar; vs the current logic routing the CMP T/F flag).

    Could deal with the Branch PC as, say:
    Calculate PC[47:16]+1, and PC[47:16]-1.
    Calculate the low 16 bits of each branch direction;
    Select direction based on branch result;
    Select high bits of PC based on selected branch direction (-1, 0, 1).

    But, worth the cost?...
    This could mostly benefit programs that spend a significant part of
    their running time dispatching in sparse switch blocks, but probably not
    a lot else.


    Disp11 couldn't deal with particularly large switch blocks, one might
    need a 96 bit encoding, possibly using 18 bits each, but this would be
    more expensive to deal with.


    Or, 2-way with fall-through:
    Rn>Rm: Branch High
    Rn<Rm: Branch Low
    Rm==Rn: Fall Through / No Branch
    The fall-through case having a branch to the case label. This would
    allow 16 (2-way) and 20/23 bit displacements (for a plain JAL/BRA), so
    could deal with much bigger "switch()" blocks.


    Would still need to think on this...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Oct 18 02:28:45 2024
    From Newsgroup: comp.arch

    On Thu, 17 Oct 2024 0:03:26 +0000, BGB wrote:

    On 10/16/2024 5:16 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:


    Ironically, one of the main arguable use-cases for old Fortran style IF
    statements is implementing the binary dispatch logic in a binary
    subdivided "switch()", but not enough to justify having a dedicated
    instruction for it.

    Say:
       MOV  Imm, Rt  //pivot case
       BLT  Rt, Rx, .lbl_lo
       BGT  Rt, Rx, .lbl_hi
       BRA  .lbl_case

    With a 64-bitinstruction one could do::

        B3W   .lbl_lo,.lbl_zero,.lbl_hi

    rather straightforwardly.....

    Possibly, but the harder part would be to deal with decoding and feeding
    the instruction through the pipeline.

    Feed the 3×15-bit displacements to the branch unit. When the condition resolves, use one of the 2 selected displacements as the target address.

    Granted, I guess it could be decoded as if it were a normal 3RI op or similar, but then split up the immediate into multiple parts in EX1.

    Why would you want do make it 3×11-bit displacements when you can
    make it 3×16-bit displacements.

    +------+-----+-----+----------------+
    | Bc | 3W | Rt | .lb_lo |
    +------+-----+-----+----------------+
    | .lb_zero | .lb_hi |
    +------------------+----------------+
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Mon Oct 21 15:13:20 2024
    From Newsgroup: comp.arch

    On 10/17/2024 9:28 PM, MitchAlsup1 wrote:
    On Thu, 17 Oct 2024 0:03:26 +0000, BGB wrote:

    On 10/16/2024 5:16 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:


    Ironically, one of the main arguable use-cases for old Fortran style IF >>>> statements is implementing the binary dispatch logic in a binary
    subdivided "switch()", but not enough to justify having a dedicated
    instruction for it.

    Say:
       MOV  Imm, Rt  //pivot case
       BLT  Rt, Rx, .lbl_lo
       BGT  Rt, Rx, .lbl_hi
       BRA  .lbl_case

    With a 64-bitinstruction one could do::

         B3W   .lbl_lo,.lbl_zero,.lbl_hi

    rather straightforwardly.....

    Possibly, but the harder part would be to deal with decoding and feeding
    the instruction through the pipeline.

    Feed the 3×15-bit displacements to the branch unit. When the condition resolves, use one of the 2 selected displacements as the target address.


    No dedicated "branch unit" in my case.

    Generally, non-predicted branching is handled by using the AGU to
    generate the address, as in a memory load, but then signaling that a
    branch should be be initiated (in the EX1 stage's glue logic).

    Generating a 3-way branch does not map to the AGU though.


    One downside of such a branch is that it would also not mix with my
    existing branch predictor logic, which thus far is built around a state machine of taken vs non-taken, so would likely ignore a 3-way branch
    (making it potentially slower than multiple conventional branches).



    Granted, I guess it could be decoded as if it were a normal 3RI op or
    similar, but then split up the immediate into multiple parts in EX1.

    Why would you want do make it 3×11-bit displacements when you can
    make it 3×16-bit displacements.

        +------+-----+-----+----------------+
        | Bc   |  3W |  Rt |   .lb_lo       |
        +------+-----+-----+----------------+
        |   .lb_zero       |  .lb_hi        |
        +------------------+----------------+

    Neither BJX2 nor RISC-V have the encoding space to pull this off...
    Even in a clean-slate ISA, it would be a big ask.


    Could be possible though in both, via a 96 bit encoding.

    Likely, a 2-way with fall-through on equal might make more sense:
    Cheaper to implement;
    If it falls through, one has already found the target case.


    But, yeah, 3x 11b isn't super useful, 2x 16b could be more useful.
    But, still wouldn't play with the branch-predictor.


    FWIW: Actually I went with the current jumbo prefix encoding rather than
    the official 64-bit instruction encoding scheme for my RV64 ext because, ironically, the route I went would eat less of the encoding space.


    Working more on BGBCC's RV64 support, I have recently ended up adding a
    mode to mimic native RISC-V ASM syntax. Ended up mostly relying on
    mnemonics to try to detect whether to use "Rd, Rs1, Rs2" vs "Rs1, Rs2,
    Rd" ordering.

    Some things are a little wonky in the assembler. As the way BGBCC had
    been doing things and the way RV ASM specifies things doesn't always
    match up strictly 1:1.


    Ended up using mnemonics:
    First thing on the line, so easy to parse;
    One of the biggest points of divergence between native RV and what BGBCC
    had been using (there wasn't really enough syntactic differences to rely
    on this to tell them apart).

    The assembler basically counts them up, and whichever side has more
    votes for it wins in terms of operand ordering.
    Say:
    LD X10, 16(X2) //will vote for Rd first ordering
    MOV.Q (SP, 16), R10 //will vote for Rd last ordering.
    LI X11, 1234 //will vote for Rd first ordering
    MOV 1234, R11 //will vote for Rd last ordering.
    MV X12, X10 //will vote for Rd first ordering
    MOV R10, R12 //will vote for Rd last ordering.
    ...

    Names that are shared in both styles have no vote either way.

    Stuff will not necessarily work as intended if one goes mix-and-match
    with the ASM styles (it is determined per ASM blob, not per line).

    Potentially, one could have ASM blobs too simple to be unambiguous, though:
    RET and JALR vote for Rd first;
    RTS and JMP vote for Rd last.
    So, theoretically, even the simplest inline ASM function should be
    unambiguous (and one isn't going to use inline ASM just to specify a
    single ADD instruction or similar...).

    For LW and SW, both are parsed as-if they were loads, but SW and similar
    have gotten new ID numbers, so if one tries to do a Load with one of the
    Store IDs, it bounces it over to the Store path in the instruction
    emitter logic. This is a little wonky, but alas (was either this or add
    wonky special case logic in the ASM parsing).

    The main alternative would have been to add assembler directives to
    indicate the operand ordering more explicitly (at which point one could
    go mix-and-match with the ASM styles if they wanted, provided directives
    were used).

    Some operand lists are only valid in certain modes though:
    ADD Rs, Imm, Rn //only valid if Rd last
    ADD Rn, Rs, Imm //only valid if Rd first
    Though, these cases don't count in the vote as they would have required
    more involved parsing. These could be used as "keys" tough, as ASM
    parsing would fail (resulting in a compiler error) if in the wrong mode.


    Note that in the ASM parsing "(R4, 16)" and "16(R4)" are considered functionally equivalent. If I wanted, could also in theory add support
    for Intel style "[R4+16]" style syntax.



    In other news:
    Was poking around and implemented a simplistic vaguely-MP3-like audio codec.

    General:
    Uses AdRice for the entropy coder;
    Uses Block-Haar as the main transform;
    As 2 levels of an 8-element Haar transform, for a 64-element block.
    Groups of 4 center blocks and 1 side block form a larger 256 sample block;
    Uses a "half-linear cubic spline" for low frequency components;
    Multiple 256 sample blocks are encoded end-to-end into larger blocks
    which are entropy-coded separately;
    A group of headers are re-encoded occasionally, these give general
    features like the encoded sample rate and main quantization tables
    (though, quantization is primarily controlled by a dynamically encoded parameter, which encodes a fixed-point scale for the block encoded
    per-block).

    The audio is encoded relative to a spline, as with just the block-Haar
    by itself, the results sounded kinda awful. Low frequencies resulted in significant blocking artifacts, and blocky stair-stepping sounds pretty
    bad with audio.


    I had set up the spline with the control points aligned with the edges
    of the blocks. This initially made sense, but I have found that sounds
    in a certain frequency range can cause the DC of the block to move significantly relative the spline (turning them into obvious square waves).


    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Oct 21 21:10:20 2024
    From Newsgroup: comp.arch

    On Mon, 21 Oct 2024 20:13:20 +0000, BGB wrote:

    On 10/17/2024 9:28 PM, MitchAlsup1 wrote:

    Granted, I guess it could be decoded as if it were a normal 3RI op or
    similar, but then split up the immediate into multiple parts in EX1.

    Why would you want do make it 3×11-bit displacements when you can
    make it 3×16-bit displacements.

        +------+-----+-----+----------------+
        | Bc   |  3W |  Rt |   .lb_lo       |
        +------+-----+-----+----------------+
        |   .lb_zero       |  .lb_hi        |
        +------------------+----------------+

    Neither BJX2 nor RISC-V have the encoding space to pull this off...
    Even in a clean-slate ISA, it would be a big ask.

    If you remove compressed instructions from RISC-V, you have enough
    room left over to put the entire My 66000 ISA. ... ... ...
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Mon Oct 21 19:38:41 2024
    From Newsgroup: comp.arch

    On 10/21/2024 4:10 PM, MitchAlsup1 wrote:
    On Mon, 21 Oct 2024 20:13:20 +0000, BGB wrote:

    On 10/17/2024 9:28 PM, MitchAlsup1 wrote:

    Granted, I guess it could be decoded as if it were a normal 3RI op or
    similar, but then split up the immediate into multiple parts in EX1.

    Why would you want do make it 3×11-bit displacements when you can
    make it 3×16-bit displacements.

         +------+-----+-----+----------------+
         | Bc   |  3W |  Rt |   .lb_lo       |
         +------+-----+-----+----------------+
         |   .lb_zero       |  .lb_hi        |
         +------------------+----------------+

    Neither BJX2 nor RISC-V have the encoding space to pull this off...
       Even in a clean-slate ISA, it would be a big ask.

    If you remove compressed instructions from RISC-V, you have enough
    room left over to put the entire My 66000 ISA. ... ... ...


    Likewise, could also fit more or less all of XG2 encoding space into the
    space as well, if the bits were shuffled around to fit the encoding
    space around RISC-V...

    I could have considered this, vs my previous BSR4I idea...
    Pro:
    Could potentially leverage my existing BJX2 decoders;
    BSR4I would have needed new decoders.
    Con:
    Possibly a bigger dog-chewed mess than my existing encoding.
    The BJX2 ISA is still a bit more complicated than RV;
    Would still need the resource cost of more decoders.


    Say:
    NMOP-YwYY-nnnn-mmmm ZZZZ-Qnmo-oooo-XXXX (F0)
    NMOP-YwYY-nnnn-mmmm ZZZZ-Qnmo-oooo-oooo (F1/F2)
    NZZP-YwYY-nnnn-ZZZn iiii-iiii-iiii-iiii (F8)

    Possible Repack:
    XXXX-oooo-oomm-mmmm-ZZZZ-nnnn-nnQY-YYPw (F0)
    oooo-oooo-oomm-mmmm-ZZZZ-nnnn-nnQY-YYPw (F1/F2)
    iiii-iiii-iiii-iiii-ZZZZ-nnnn-nnZY-YYPw (F8)
    00: OP?T
    01: OP?F
    10: OP
    11: RV OP32

    If I did so though, would likely:
    Drop FA and FB blocks, and rework the F8 block
    Implicitly, WEX and PrWEX are dropped;
    Would need to use superscalar.
    The FA and FB blocks would take over the Jumbo-Prefix role.

    Likely:
    Special case F8 so that it makes sense;
    Special case F1 and F2 so that immediate bits are contiguous;
    May make sense to relocate BRA and BSR from F0 to F8.
    Likely reduced from 23 to 22 bits.

    Where, YYY:
    000: F0 (3R ops)
    001: F1 (LD/ST Disp10)
    010: F2 (3RI Imm10 Ops)
    011: F3 (Reserved / User)
    100: F8 (Imm16 ops)
    101: F9 (Reserved)
    110: FE (Jumbo Prefix)
    111: FF (Jumbo Prefix)

    Probably using a variation of XG2RV rules (IOW: Uses same register space
    and ABI as RISC-V).



    Ironically, repacking XG2 to fit into the RV encoding space might
    actually be easier than trying to expand RISC-V register fields to 6
    bits and fit it into the same space.

    If doing so, it would likely make sense to only carry over certain
    encoding blocks, say:
    0z-000 -> 000: LD / ST (O select)
    11-000 -> 001: BEQ
    11-000 -> 010: -
    11-000 -> 011: -
    01-100 -> 100: ALU
    01-110 -> 101: ALUW
    10-100 -> 110: FPU
    00-1z0 -> 111: ALUI / ALUIW (O select)

    ZZZZZZZ-ooooo-mmmmm-ZZZ-nnnnn-nm-YYY0o

    Where, say:
    0z: RV, Expanded 6b
    10: -
    11: Original RV OP32


    Or, more aggressive:
    0z-000 -> 00: LD / ST (O select)
    00-1z0 -> 01: ALUI / ALUIW (O select)
    01-100 -> 10: ALU
    01-110 -> 11: ALUW

    ZZZZZZZ-ooooo-mmmmm-ZZZ-nnnnn-YY-nmo00

    Where:
    00: RV, Expanded 6b
    01: -
    10: -
    11: Original RV OP32


    Though, the top 4 blocks of RV is probably less useful than nearly the
    entire XG2 ISA...



    Though, not sure how well "Repacked XG2RV hot glued onto RISC-V" would
    go over.

    Would still have the downside of needing a special/separate operating
    mode. Well, and the wonk that it would still be essentially two ISA's awkwardly glued together.



    But, then again, there seems to still be a roughly 19% performance delta between my current extended RISC-V and XG2 when it comes to running
    Doom. As, sadly, Jumbo Prefixes and Indexed Load/Store were still not
    enough to entirely close the gap.

    Eg:
    XG2 : 25 fps
    RV+J : 21 fps
    RV64G (GCC): 17 fps


    Implementation would be easier, in that it would be mostly "take
    existing ISA and shuffle the bits around" on the encoder and decoder sides.


    Some people really like the C extension though, but granted, it makes
    more sense for microcontrollers.

    IME, performance oriented code isn't really limited by I$ miss rate. I$
    misses are a bigger issue with 4K or 8K I$, but much less of an issue
    with 32K I$.


    Well, and also XG2 is currently managing to be smaller than RV64GC as
    well, as fewer instructions is saving more than "common instructions
    using less space" (like, 'C' saves 35%, but avoiding most of the cases
    that need multi-instruction sequences saves 60%, ...).

    Jumbo prefixes and similar help, but would still need to shave off
    another 20% here.


    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114