• Current/Ongoing Experiment: XG3RV

    From BGB@cr88192@gmail.com to comp.arch on Thu Nov 7 00:43:30 2024
    From Newsgroup: comp.arch

    So, a general spec is here: https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/2024-10-22_XG3RV.txt

    So, basically what it is, is a bit-repacked and tweaked version of the
    BJX2 ISA, but glued onto the RISC-V ISA (in encoding space reclaimed
    from the 'C' extension); creating a sort of hybrid ISA mode.

    In the CPU core and emulator, it is handled as a special case of XG2
    (mostly gluing on bit-repacking and various special cases).


    In BGBCC, it is being handled as a sub-mode of RV64.
    In effect, this means new instruction emitter logic and similar.
    Thus far, it is mostly operating on an ISA subset.

    Most of the functional restructuring that was applied to the RV64G
    target is also being applied to XG3 (but, then again, for the RV64
    target, the emitter stage was faking a lot of stuff when generating code
    for RV64, and with XG3 some amount of this can be handled more directly
    as more of the needed instructions "actually exist").



    Though, amidst all of this, my older cat has passed on (he was roughly
    18). This was a very sad/unpleasant experience, and my emotional state
    has not entirely stabilized. But, yeah, can still be sad about things.
    He was a long term furry friend, who liked to sit on my keyboard and lay across my chest/shoulder. The world seems a little more empty now.

    It is difficult to know how to express myself, as my mind is not always entirely cohesive in these areas.



    In effect, XG3 expands the register space back up to 64 GPRs, but
    doesn't currently get the full set in BGBCC partly because RV64 cuts off
    a few registers, and (indirectly) because the balance of scratch and callee-save registers doesn't match up, so the strategy (in the
    compiler) of remapping the XG2 registers to RV equivalents in the
    register allocator, doesn't quite work out. This could be made closer,
    but would require either more work on the register allocator, or
    changing the ABI to reassign some of the extra scratch FPRs over to the
    callee save side.

    Where, this mode essentially has 24 callee save registers, and 35
    scratch registers, vs the BJX2/XG2 ABI being closer to an even split:
    31 callee-save + 30 scratch.


    Changing the register balance is preferably avoided though as this is significantly more likely to break code interop with code compiled with
    GCC (in RV64G mode). Similar, for now I am sticking with the LP64 ABI (8 arguments via R10..R17).

    Theoretically, the extra scratch registers could potentially be useful
    in ASM code and leaf functions though.



    For now, it is not using any predication, and is instead handling all conditional logic and branches as-in RISC-V (namely, using plain compare-and-branch). Current thinking is that predication will be
    demoted to optional.

    Potentially also, instruction support in XG3RV will be re-aligned to
    match up with corresponding RISC-V extensions.


    Doom ".text" size stats:
    XG2 : 289K
    XG3RV : 320K
    RV64G+Jumbo : 360K
    RV64GC(GCC) : 393K (with 'C' extension)
    RV64G(BGBCC) : 438K
    RV64G(GCC) : 445K

    Doom fps, start of E1M1:
    XG2 : 25
    XG3RV : 23
    RV64G+Jumbo : 20
    RV64G(GCC) : 17
    RV64G(BGBCC): 12


    So, as-is, XG3RV still doesn't quite match XG2 in terms of either code
    density or performance, but it is a lot closer.

    Not entirely obvious where the delta is, but most likely in edge cases.
    Apart from edge cases and predication, currently most of XG2 is
    available in XG3.


    The difference between 32 and 64 GPRs does make a difference, but
    relatively modest. But, XG3RV with 32 GPRs does still do slightly better
    (for both code-density and performance) than RV64G+Jumbo.

    Using a 64-GPR configuration adds around 2 fps, and shaves ~ 10K off the binary.


    Note that 64 GPRs operation via the jumbo-extension generally makes code density and performance worse (but, not exactly a surprise there). So,
    to some extent, they are tied together (32 GPR XG3 isn't nearly as
    useful, and 64 GPR via jumbo encodings sucks worse than limiting things
    to 32 GPRs).



    Seems like possibly, some amount of the difference is being made by
    having EXTS.B and EXTS.W instructions, vs using pairs of shifts.
    EXTU.B and EXTS.L, and EXTU.L have direct analogs in RV64.
    Relatively few "novel" instructions are seeing significant use.

    Well, along with semi-common cases which are absent in both RV64G and
    the B extension.

    There ADDU.L and SUBU.L, which existed in earlier versions of BitManip
    as ADDWU and SUBWU (in my efforts, I had re-added them). Apparently,
    they had been dropped with the reasoning that ADD+ADD.UW and SUB+ADD.UW
    could mimic the behavior. Some functional differences exist between
    BJX2/XG3RV and BitManip mostly on the difference that my stuff tends to
    assume "unsigned int"being zero-extended, but BitManip seems to
    prioritize the assumption of sign-extended "unsigned int" (as does the
    RV ABI). There is a little wonk here in that BGBCC tends to assume zero-extended unsigned types.

    As I see it though, zero-extended unsigned types have less wonk than sign-extended unsigned types. As I see it though, there is a merit in an
    ISA being able to work with "unsigned int" in ways that "don't suck".



    Some things were dropped in this conversion:
    Pretty much all of the 1R encodings, but there are relatively few that
    seemed relevant in this case (though there may still be a need for JMP
    and JSR with a register, these are currently handled on the RV64 side);
    All of the 2RI Imm10 encodings,

    The 2RI encodings would either need additional repack twiddling, or to
    accept the slightly wonky encodings resulting from the current repacking
    rules (as I ended up doing for the 2RI encodings, or deal with it in the
    main decoder).

    However, I noted when looking at it, that none of the existing 2RI Imm10 encodings were strictly needed in XG3 (and a few of the "more relevant"
    cases ended up migrated to the F8 block).


    If I were to map over this block as-is, likely instruction format would be:
    * akii-iiiiii-bjXXXX-ZZZZ-nnnnnn-QY-YYPw (F2 Imm10)

    With the immediate being decoded as:
    Imm10s: jcdk-iiii-iiii (12-bit)
    Where j is the sign-extension, and c-a^j, d=b^j.
    Imm10u: jabk-iiii-iiii (12-bit)
    Zero extended, no XOR as a^0=0
    Imm10n: jABk-iiii-iiii (12-bit)
    One extended, A/B inverted as a^1=!a

    Would be better to have a consistent linear 12-bit register field with
    no XOR's, but this would require more special cases in the decoding
    (where). I have yet to decide on what I will go with here.

    Though, one "cheaper" option would be to break with XG2 here, and merely
    reuse the same Imm10 field format as the 3RI encodings, with the 2
    remaining bits then reassigned to the opcode field (and still mostly
    matches up with Baseline).


    Or:
    * iiii-iiiiii-XXXXXX-ZZZZ-nnnnnn-QY-YYPw (F2 Imm10)

    Either way, don't really want fields with XOR'ed bits in them, as this
    was ugly in XG2 (but did preserve backwards compatibility with Baseline encodings).



    Mostly ended up adding (among other things):
    MOV.{L/Q} Rn, (GBR, Disp16u*{4/8})
    MOV.{L/Q} (GBR, Disp16u*{4/8}), Rn
    MOVU.L (GBR, Disp16u*4), Rn
    Mostly in sub-spaces that are N/E to Baseline.

    Where, potentially, a Disp16u is slightly overkill here, but it works
    (these mostly deal with global variables, which with my current test
    programs can be addressed effectively with around 16K-32K of range, and 256K/512K could be a little overkill). Does at least mean none of the
    global variables is out of range (though, OTOH, "LEA.Q {GBR, Disp16u},
    Rn" still generally falls short of being able to address all of the
    global arrays).


    I considered but did not add a (PC, Disp16s) LEA:
    Generally, most things that have a PC-relative address taken in this way
    are out-of-range of a 16-bit displacement (would need around a 20 bit displacement to be useful). More so, as the most common case here is
    loading string literals, where the string table is very unlikely to be in-range.

    In this case, a jumbo-encoded (PC, Disp33s) was more useful, as
    everything which needs PC-relative addressing falls within a 4GB window.


    As can be noted, XG3 does drop the use of WEX and instead assumes the
    use of superscalar.

    At present, superscalar register alias checks are only performed either between pairs of RV64 ops or pairs of XG3 ops.

    I had initially tried to do generic logic that could check RV64 and XG3
    ops against each other, but FPGA timing was not happy. It was faster to
    run RV64/RV64 and XG3/XG3 checks and parallel and then select results
    between op types (implicitly not co-issuing RV64 and XG3 ops).

    Possible could be to do 4-sets of checks and then select whichever
    results match the instructions present.



    Luckily, the repacking pattern for XG3 ops makes superscalar checks
    easier. Likely, doing superscalar with XG2 or the BSR4I idea would have
    been more complicated by the higher variability in the register fields.
    Unlike XG2, the encoding of the register fields is fully normalized
    between the encoding blocks.

    In the fetch/decode path, there were a few bits added to each
    instruction work to distinguish between BJX2/XG2 ops, RISC-V ops, and
    repacked XG3 ops. This was because the decoder needs to be able to know
    what the original format for the instruction was to decode correctly
    (with mixed-mode instruction streams, it being no longer sufficient to
    rely solely on the current operating mode).


    But, any thoughts?...


    --- Synchronet 3.20a-Linux NewsLink 1.114