From Newsgroup: comp.arch
So, a general spec is here:
https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/2024-10-22_XG3RV.txt
So, basically what it is, is a bit-repacked and tweaked version of the
BJX2 ISA, but glued onto the RISC-V ISA (in encoding space reclaimed
from the 'C' extension); creating a sort of hybrid ISA mode.
In the CPU core and emulator, it is handled as a special case of XG2
(mostly gluing on bit-repacking and various special cases).
In BGBCC, it is being handled as a sub-mode of RV64.
In effect, this means new instruction emitter logic and similar.
Thus far, it is mostly operating on an ISA subset.
Most of the functional restructuring that was applied to the RV64G
target is also being applied to XG3 (but, then again, for the RV64
target, the emitter stage was faking a lot of stuff when generating code
for RV64, and with XG3 some amount of this can be handled more directly
as more of the needed instructions "actually exist").
Though, amidst all of this, my older cat has passed on (he was roughly
18). This was a very sad/unpleasant experience, and my emotional state
has not entirely stabilized. But, yeah, can still be sad about things.
He was a long term furry friend, who liked to sit on my keyboard and lay across my chest/shoulder. The world seems a little more empty now.
It is difficult to know how to express myself, as my mind is not always entirely cohesive in these areas.
In effect, XG3 expands the register space back up to 64 GPRs, but
doesn't currently get the full set in BGBCC partly because RV64 cuts off
a few registers, and (indirectly) because the balance of scratch and callee-save registers doesn't match up, so the strategy (in the
compiler) of remapping the XG2 registers to RV equivalents in the
register allocator, doesn't quite work out. This could be made closer,
but would require either more work on the register allocator, or
changing the ABI to reassign some of the extra scratch FPRs over to the
callee save side.
Where, this mode essentially has 24 callee save registers, and 35
scratch registers, vs the BJX2/XG2 ABI being closer to an even split:
31 callee-save + 30 scratch.
Changing the register balance is preferably avoided though as this is significantly more likely to break code interop with code compiled with
GCC (in RV64G mode). Similar, for now I am sticking with the LP64 ABI (8 arguments via R10..R17).
Theoretically, the extra scratch registers could potentially be useful
in ASM code and leaf functions though.
For now, it is not using any predication, and is instead handling all conditional logic and branches as-in RISC-V (namely, using plain compare-and-branch). Current thinking is that predication will be
demoted to optional.
Potentially also, instruction support in XG3RV will be re-aligned to
match up with corresponding RISC-V extensions.
Doom ".text" size stats:
XG2 : 289K
XG3RV : 320K
RV64G+Jumbo : 360K
RV64GC(GCC) : 393K (with 'C' extension)
RV64G(BGBCC) : 438K
RV64G(GCC) : 445K
Doom fps, start of E1M1:
XG2 : 25
XG3RV : 23
RV64G+Jumbo : 20
RV64G(GCC) : 17
RV64G(BGBCC): 12
So, as-is, XG3RV still doesn't quite match XG2 in terms of either code
density or performance, but it is a lot closer.
Not entirely obvious where the delta is, but most likely in edge cases.
Apart from edge cases and predication, currently most of XG2 is
available in XG3.
The difference between 32 and 64 GPRs does make a difference, but
relatively modest. But, XG3RV with 32 GPRs does still do slightly better
(for both code-density and performance) than RV64G+Jumbo.
Using a 64-GPR configuration adds around 2 fps, and shaves ~ 10K off the binary.
Note that 64 GPRs operation via the jumbo-extension generally makes code density and performance worse (but, not exactly a surprise there). So,
to some extent, they are tied together (32 GPR XG3 isn't nearly as
useful, and 64 GPR via jumbo encodings sucks worse than limiting things
to 32 GPRs).
Seems like possibly, some amount of the difference is being made by
having EXTS.B and EXTS.W instructions, vs using pairs of shifts.
EXTU.B and EXTS.L, and EXTU.L have direct analogs in RV64.
Relatively few "novel" instructions are seeing significant use.
Well, along with semi-common cases which are absent in both RV64G and
the B extension.
There ADDU.L and SUBU.L, which existed in earlier versions of BitManip
as ADDWU and SUBWU (in my efforts, I had re-added them). Apparently,
they had been dropped with the reasoning that ADD+ADD.UW and SUB+ADD.UW
could mimic the behavior. Some functional differences exist between
BJX2/XG3RV and BitManip mostly on the difference that my stuff tends to
assume "unsigned int"being zero-extended, but BitManip seems to
prioritize the assumption of sign-extended "unsigned int" (as does the
RV ABI). There is a little wonk here in that BGBCC tends to assume zero-extended unsigned types.
As I see it though, zero-extended unsigned types have less wonk than sign-extended unsigned types. As I see it though, there is a merit in an
ISA being able to work with "unsigned int" in ways that "don't suck".
Some things were dropped in this conversion:
Pretty much all of the 1R encodings, but there are relatively few that
seemed relevant in this case (though there may still be a need for JMP
and JSR with a register, these are currently handled on the RV64 side);
All of the 2RI Imm10 encodings,
The 2RI encodings would either need additional repack twiddling, or to
accept the slightly wonky encodings resulting from the current repacking
rules (as I ended up doing for the 2RI encodings, or deal with it in the
main decoder).
However, I noted when looking at it, that none of the existing 2RI Imm10 encodings were strictly needed in XG3 (and a few of the "more relevant"
cases ended up migrated to the F8 block).
If I were to map over this block as-is, likely instruction format would be:
* akii-iiiiii-bjXXXX-ZZZZ-nnnnnn-QY-YYPw (F2 Imm10)
With the immediate being decoded as:
Imm10s: jcdk-iiii-iiii (12-bit)
Where j is the sign-extension, and c-a^j, d=b^j.
Imm10u: jabk-iiii-iiii (12-bit)
Zero extended, no XOR as a^0=0
Imm10n: jABk-iiii-iiii (12-bit)
One extended, A/B inverted as a^1=!a
Would be better to have a consistent linear 12-bit register field with
no XOR's, but this would require more special cases in the decoding
(where). I have yet to decide on what I will go with here.
Though, one "cheaper" option would be to break with XG2 here, and merely
reuse the same Imm10 field format as the 3RI encodings, with the 2
remaining bits then reassigned to the opcode field (and still mostly
matches up with Baseline).
Or:
* iiii-iiiiii-XXXXXX-ZZZZ-nnnnnn-QY-YYPw (F2 Imm10)
Either way, don't really want fields with XOR'ed bits in them, as this
was ugly in XG2 (but did preserve backwards compatibility with Baseline encodings).
Mostly ended up adding (among other things):
MOV.{L/Q} Rn, (GBR, Disp16u*{4/8})
MOV.{L/Q} (GBR, Disp16u*{4/8}), Rn
MOVU.L (GBR, Disp16u*4), Rn
Mostly in sub-spaces that are N/E to Baseline.
Where, potentially, a Disp16u is slightly overkill here, but it works
(these mostly deal with global variables, which with my current test
programs can be addressed effectively with around 16K-32K of range, and 256K/512K could be a little overkill). Does at least mean none of the
global variables is out of range (though, OTOH, "LEA.Q {GBR, Disp16u},
Rn" still generally falls short of being able to address all of the
global arrays).
I considered but did not add a (PC, Disp16s) LEA:
Generally, most things that have a PC-relative address taken in this way
are out-of-range of a 16-bit displacement (would need around a 20 bit displacement to be useful). More so, as the most common case here is
loading string literals, where the string table is very unlikely to be in-range.
In this case, a jumbo-encoded (PC, Disp33s) was more useful, as
everything which needs PC-relative addressing falls within a 4GB window.
As can be noted, XG3 does drop the use of WEX and instead assumes the
use of superscalar.
At present, superscalar register alias checks are only performed either between pairs of RV64 ops or pairs of XG3 ops.
I had initially tried to do generic logic that could check RV64 and XG3
ops against each other, but FPGA timing was not happy. It was faster to
run RV64/RV64 and XG3/XG3 checks and parallel and then select results
between op types (implicitly not co-issuing RV64 and XG3 ops).
Possible could be to do 4-sets of checks and then select whichever
results match the instructions present.
Luckily, the repacking pattern for XG3 ops makes superscalar checks
easier. Likely, doing superscalar with XG2 or the BSR4I idea would have
been more complicated by the higher variability in the register fields.
Unlike XG2, the encoding of the register fields is fully normalized
between the encoding blocks.
In the fetch/decode path, there were a few bits added to each
instruction work to distinguish between BJX2/XG2 ops, RISC-V ops, and
repacked XG3 ops. This was because the decoder needs to be able to know
what the original format for the instruction was to decode correctly
(with mixed-mode instruction streams, it being no longer sufficient to
rely solely on the current operating mode).
But, any thoughts?...
--- Synchronet 3.20a-Linux NewsLink 1.114