Forum: War Ensemble BBS

Re: Misc: BGBCC targeting RV64G, initial results...

From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Wed Oct 16 11:59:57 2024

From Newsgroup: comp.arch

On 9/30/24 1:52 AM, MitchAlsup1 wrote:

On Sat, 28 Sep 2024 1:44:10 +0000, Paul A. Clayton wrote:

[snip]

Another weird concept that came to mind would be providing an
8-bit (e.g.) field that enumerated a set of interesting
conditions.

I use a 64-bit container of conditions

A enumeration of conditions is different from a bitmask of
conditions. An enumeration could support N-way branching in a
single instruction rather than a tree of single bit-condition
branches.

My 66000's compare result has unused space for multiple such
enumerations.

I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch. Relation to zero as well
as an explicit comparison value might be useful for some multi-way
choices.

Lack of density is also a problem for multi-way branches; the
encoding will waste space if multiple enumerated states share a
target.

The concept seemed worth mentioning even if I thought it unlikely
to be practically useful.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 16 11:07:08 2024

From Newsgroup: comp.arch

On 10/16/2024 8:59 AM, Paul A. Clayton wrote:

snip

I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch.

That was the function of the arithmetic if statement in original
Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 19:11:02 2024

From Newsgroup: comp.arch

On Wed, 16 Oct 2024 15:59:57 +0000, Paul A. Clayton wrote:

On 9/30/24 1:52 AM, MitchAlsup1 wrote:

On Sat, 28 Sep 2024 1:44:10 +0000, Paul A. Clayton wrote:

[snip]

Another weird concept that came to mind would be providing an
8-bit (e.g.) field that enumerated a set of interesting
conditions.

I use a 64-bit container of conditions

A enumeration of conditions is different from a bitmask of
conditions. An enumeration could support N-way branching in a
single instruction rather than a tree of single bit-condition
branches.

My 66000's compare result has unused space for multiple such
enumerations.

One can "do" 3-way branching as is:: CMP-BC1-BC2-other

I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch. Relation to zero as well
as an explicit comparison value might be useful for some multi-way
choices.

3-way branches are out of style:: Fortran disinherited them
while IEEE 754 made them need to be 4-way (NaN).

Lack of density is also a problem for multi-way branches; the
encoding will waste space if multiple enumerated states share a
target.

The concept seemed worth mentioning even if I thought it unlikely
to be practically useful.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 16 19:17:17 2024

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 10/16/2024 8:59 AM, Paul A. Clayton wrote:

snip

I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch.

That was the function of the arithmetic if statement in original
Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.

Not so long ago, actually, it was only dropped in Fortran 2018.
I actually think that this is a bad idea, compilers will continue
to support such features, but possible interactions with other
features will no longer be properly defined.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 16 15:23:08 2024

From Newsgroup: comp.arch

On 10/16/2024 1:07 PM, Stephen Fuld wrote:

On 10/16/2024 8:59 AM, Paul A. Clayton wrote:

snip

I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch.

That was the function of the arithmetic if statement in original
Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.

Yeah...

Ironically, one of the main arguable use-cases for old Fortran style IF statements is implementing the binary dispatch logic in a binary
subdivided "switch()", but not enough to justify having a dedicated instruction for it.

Say:
MOV Imm, Rt //pivot case
BLT Rt, Rx, .lbl_lo
BGT Rt, Rx, .lbl_hi
BRA .lbl_case

But, absent having multiple labels per branch, not really a good way to
save much over this...

Otherwise, had recently been still working on BGBCC+RV stuff:
Trying to getting stuff working correctly in my Verilog implementation.
There are still some bugs here.

Writing a spec for a "low-cost" FPU SIMD extension:
https://pastebin.com/9UeAP9Yk

Which basically just takes the arguably cheaper route of "extend the F,
D, and Zfh extensions to support basic FPU-SIMD in the existing FPRs"
rather than "define a whole new complicated mess of stuff" that is the V extension.

Some details are still in-flux, and I have not yet decided whether or
not to map over the FP8 converter ops and similar. Arguably FP8 and
A-Law converter ops are a bit niche though.

As well as looking some at the P spec, which (ignoring the needlessly complicated parts) isn't too far from what BJX2 does SIMD wise (albeit
lacks obvious direct equivalents of the RGB555 helper instructions; but possibly using SIMD to work with RGB555 pixel data is a bit niche).

It is possible if I add some of this, I may do it as jumbo-prefix-only
ops. One is unlikely to see RGB555 or FP8 converters used in any
significant density (except maybe if doing highly-unrolled NN code using
FP8 or similar; but unclear if it would try to make sense to map this
over to RV anyways; and existing people trying to do stuff in this area
appear to be mostly focused on the V extension).

For normal graphical or audio processing, having these sorts of niche converters as 64-bit encodings would probably be fine.

As-is, it could do a 4x32 shuffle in 2 instructions, but would need
either a 4-op sequence (no jumbo), or a jumbo-encoded op, to perform a
4x16 shuffle (it is either this or define a dedicated "FPSHUF.H"
instruction or similar). Can probably assume, if it matters, will
probably also have a jumbo prefix.

May still need to decide on some other things, like whether to map over
a jumbo-encoded 4xFP8 to 4xFP16 constant-load. Or, whether to come up
with an encoding to load an arbitrary 64-bit value into an FPR
(currently N/E in RV64 mode).

As-is:
J22+J22+LUI : LI Xn, Imm64
J22+J22+AUIPC: Unused
J22+J22+JAL : Unused, Possible "JAL Rn, Abs64"

For FPR's, in may make sense to have:
Load Binary16, expanding to Binary64 (already in Jumbo spec)
Load Imm33s into low-order bits (Jumbo spec, J12O+LUI)
Load Imm32 into high-order bits
Possible, not yet defined, already exists in BJX2 (1).
Load Imm32 as 2xFP16 expanding to 2xFP32
Possible, not yet defined, already exists in BJX2 (1).
Load Imm32 as 4xFP8 expanding to 4xFP16
Possible, not yet defined, already exists in BJX2 (1).

*1: Probably could define it as J12O+LUI, using the Wm and Wo register-extension bits to encode which type of constant to load
(basically about the same as how I did it in BJX2; just it had used a
J_OP and "MOV Imm16u, Rn" instruction instead, but similar basic idea here).
Probably, say:
00: Load Imm33s to low 32-bits, sign-extend as usual
01: Load Imm32 to high 32-bits (sign bit used for LSB fill, *2)
10: 2xFP16 -> 2xFP32
11: 4xFP8 -> 4xFP16

*2: Though in BJX2, this case was encoded as J_IMM+"LDIHI Imm10, Rn"

Could maybe be tempted to reclaim "J22+J22+AUIPC" as:
LI Fn, Imm64
Arguing that, if one needs PC-rel, +/- 4GB is sufficient; and one is far
more likely to want to be able to load constants like M_PI and similar
into an FPU register (in a single clock-cycle).

Though, if one has this, the other constant cases (2xFP16 or 4xFP8)
would be merely space-saving (mostly relevant to FP-SIMD vector
literals), but may be lower priority mostly as they are infrequently
used (and thus the space savings are less significant).

Relative cost-difference is small, if one assumes an implementation
where the constant-load cases use the same converters as used for the
normal vector conversion path, which would be (presumably) already present.

Most of this would be largely irrelevant to Doom performance, but would
be relevant if I want to try to make GLQuake work at some semblance of
usable in RV Mode.

Less immediate relevance to SW Quake, which uses mostly scalar FPU (and
mostly naively represents vectors as in-memory pointers).

In this stuff, I have also started running into annoyance of noting differences and additions/removals/changed between different versions of
the BitManip spec / B extension. A few useful ops were removed in newer versions, ...

My Jumbo prefix encoding would have conflicted with an earlier version
of BitManip, but does not conflict with the current form of the B
extension (it exists in the shadow of previously-removed instructions).

Felt curious and looked, it looks like the person mostly responsible for
the B extension has largely "gone quiet" for the past year or so (no
recent social media posts, has seemingly taken down all of their past
YouTube and Twitch contents; minimal activity on GitHub). Not entirely
sure what is going on there.

...

Otherwise, did see a video talking some about performance of Doom and
Quake and similar on older systems:
Doom apparently required something like a 486 DX2-66 to perform well.
Quake apparently required a faster Pentium system to be playable.
Apparently, likewise for Hexen, ...
Apparently Wolf3D needed a higher-end 386 to perform well.
Even if it could technically run on a 286.
...

I guess this differs from my prior understanding that Doom would have
been mostly playable on a 25 MHz 386 or similar. Apparently, not really.

So, I guess I can feel not quite as bad about the lackluster framerates
from Quake and Hexen on a 50MHz core. Seemingly, it is in-fact still outperforming vintage (early 90s) PCs.

Well, and Quake3 is pretty slow, but IIRC, PCs of that era were
generally pushing 1GHz, so...

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 22:16:29 2024

From Newsgroup: comp.arch

On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:

Ironically, one of the main arguable use-cases for old Fortran style IF statements is implementing the binary dispatch logic in a binary
subdivided "switch()", but not enough to justify having a dedicated instruction for it.

Say:
MOV Imm, Rt //pivot case
BLT Rt, Rx, .lbl_lo
BGT Rt, Rx, .lbl_hi
BRA .lbl_case

With a 64-bitinstruction one could do::

B3W .lbl_lo,.lbl_zero,.lbl_hi

rather straightforwardly.....
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 16 19:03:26 2024

From Newsgroup: comp.arch

On 10/16/2024 5:16 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:

Ironically, one of the main arguable use-cases for old Fortran style IF
statements is implementing the binary dispatch logic in a binary
subdivided "switch()", but not enough to justify having a dedicated
instruction for it.

Say:
   MOV Imm, Rt //pivot case
   BLT Rt, Rx, .lbl_lo
   BGT Rt, Rx, .lbl_hi
   BRA .lbl_case

With a 64-bitinstruction one could do::

    B3W   .lbl_lo,.lbl_zero,.lbl_hi

rather straightforwardly.....

Possibly, but the harder part would be to deal with decoding and feeding
the instruction through the pipeline.

Granted, I guess it could be decoded as if it were a normal 3RI op or
similar, but then split up the immediate into multiple parts in EX1.

Say:
Decode as a 3RI Imm33s;
Then split the immediate into 3x 11-bits, calculate 3 offsets relative
to PC, and apply the one which matches the result of the comparison
(likely needing to route the S and Z flags from the subtract logic to
EX1 or similar; vs the current logic routing the CMP T/F flag).

Could deal with the Branch PC as, say:
Calculate PC[47:16]+1, and PC[47:16]-1.
Calculate the low 16 bits of each branch direction;
Select direction based on branch result;
Select high bits of PC based on selected branch direction (-1, 0, 1).

But, worth the cost?...
This could mostly benefit programs that spend a significant part of
their running time dispatching in sparse switch blocks, but probably not
a lot else.

Disp11 couldn't deal with particularly large switch blocks, one might
need a 96 bit encoding, possibly using 18 bits each, but this would be
more expensive to deal with.

Or, 2-way with fall-through:
Rn>Rm: Branch High
Rn<Rm: Branch Low
Rm==Rn: Fall Through / No Branch
The fall-through case having a branch to the case label. This would
allow 16 (2-way) and 20/23 bit displacements (for a plain JAL/BRA), so
could deal with much bigger "switch()" blocks.

Would still need to think on this...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Oct 18 02:28:45 2024

From Newsgroup: comp.arch

On Thu, 17 Oct 2024 0:03:26 +0000, BGB wrote:

On 10/16/2024 5:16 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:

Ironically, one of the main arguable use-cases for old Fortran style IF
statements is implementing the binary dispatch logic in a binary
subdivided "switch()", but not enough to justify having a dedicated
instruction for it.

Say:
   MOV Imm, Rt //pivot case
   BLT Rt, Rx, .lbl_lo
   BGT Rt, Rx, .lbl_hi
   BRA .lbl_case

With a 64-bitinstruction one could do::

    B3W   .lbl_lo,.lbl_zero,.lbl_hi

rather straightforwardly.....

Possibly, but the harder part would be to deal with decoding and feeding
the instruction through the pipeline.

Feed the 3×15-bit displacements to the branch unit. When the condition resolves, use one of the 2 selected displacements as the target address.

Granted, I guess it could be decoded as if it were a normal 3RI op or similar, but then split up the immediate into multiple parts in EX1.

Why would you want do make it 3×11-bit displacements when you can
make it 3×16-bit displacements.

+------+-----+-----+----------------+
| Bc | 3W | Rt | .lb_lo |
+------+-----+-----+----------------+
| .lb_zero | .lb_hi |
+------------------+----------------+
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Mon Oct 21 15:13:20 2024

From Newsgroup: comp.arch

On 10/17/2024 9:28 PM, MitchAlsup1 wrote:

On Thu, 17 Oct 2024 0:03:26 +0000, BGB wrote:

On 10/16/2024 5:16 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:

Ironically, one of the main arguable use-cases for old Fortran style IF >>>> statements is implementing the binary dispatch logic in a binary
subdivided "switch()", but not enough to justify having a dedicated
instruction for it.

Say:
   MOV Imm, Rt //pivot case
   BLT Rt, Rx, .lbl_lo
   BGT Rt, Rx, .lbl_hi
   BRA .lbl_case

With a 64-bitinstruction one could do::

     B3W   .lbl_lo,.lbl_zero,.lbl_hi

rather straightforwardly.....

Possibly, but the harder part would be to deal with decoding and feeding
the instruction through the pipeline.

Feed the 3×15-bit displacements to the branch unit. When the condition resolves, use one of the 2 selected displacements as the target address.

No dedicated "branch unit" in my case.

Generally, non-predicted branching is handled by using the AGU to
generate the address, as in a memory load, but then signaling that a
branch should be be initiated (in the EX1 stage's glue logic).

Generating a 3-way branch does not map to the AGU though.

One downside of such a branch is that it would also not mix with my
existing branch predictor logic, which thus far is built around a state machine of taken vs non-taken, so would likely ignore a 3-way branch
(making it potentially slower than multiple conventional branches).

Granted, I guess it could be decoded as if it were a normal 3RI op or
similar, but then split up the immediate into multiple parts in EX1.

Why would you want do make it 3×11-bit displacements when you can
make it 3×16-bit displacements.

    +------+-----+-----+----------------+
    | Bc   | 3W | Rt |   .lb_lo       |
    +------+-----+-----+----------------+
    |   .lb_zero       | .lb_hi        |
    +------------------+----------------+

Neither BJX2 nor RISC-V have the encoding space to pull this off...
Even in a clean-slate ISA, it would be a big ask.

Could be possible though in both, via a 96 bit encoding.

Likely, a 2-way with fall-through on equal might make more sense:
Cheaper to implement;
If it falls through, one has already found the target case.

But, yeah, 3x 11b isn't super useful, 2x 16b could be more useful.
But, still wouldn't play with the branch-predictor.

FWIW: Actually I went with the current jumbo prefix encoding rather than
the official 64-bit instruction encoding scheme for my RV64 ext because, ironically, the route I went would eat less of the encoding space.

Working more on BGBCC's RV64 support, I have recently ended up adding a
mode to mimic native RISC-V ASM syntax. Ended up mostly relying on
mnemonics to try to detect whether to use "Rd, Rs1, Rs2" vs "Rs1, Rs2,
Rd" ordering.

Some things are a little wonky in the assembler. As the way BGBCC had
been doing things and the way RV ASM specifies things doesn't always
match up strictly 1:1.

Ended up using mnemonics:
First thing on the line, so easy to parse;
One of the biggest points of divergence between native RV and what BGBCC
had been using (there wasn't really enough syntactic differences to rely
on this to tell them apart).

The assembler basically counts them up, and whichever side has more
votes for it wins in terms of operand ordering.
Say:
LD X10, 16(X2) //will vote for Rd first ordering
MOV.Q (SP, 16), R10 //will vote for Rd last ordering.
LI X11, 1234 //will vote for Rd first ordering
MOV 1234, R11 //will vote for Rd last ordering.
MV X12, X10 //will vote for Rd first ordering
MOV R10, R12 //will vote for Rd last ordering.
...

Names that are shared in both styles have no vote either way.

Stuff will not necessarily work as intended if one goes mix-and-match
with the ASM styles (it is determined per ASM blob, not per line).

Potentially, one could have ASM blobs too simple to be unambiguous, though:
RET and JALR vote for Rd first;
RTS and JMP vote for Rd last.
So, theoretically, even the simplest inline ASM function should be
unambiguous (and one isn't going to use inline ASM just to specify a
single ADD instruction or similar...).

For LW and SW, both are parsed as-if they were loads, but SW and similar
have gotten new ID numbers, so if one tries to do a Load with one of the
Store IDs, it bounces it over to the Store path in the instruction
emitter logic. This is a little wonky, but alas (was either this or add
wonky special case logic in the ASM parsing).

The main alternative would have been to add assembler directives to
indicate the operand ordering more explicitly (at which point one could
go mix-and-match with the ASM styles if they wanted, provided directives
were used).

Some operand lists are only valid in certain modes though:
ADD Rs, Imm, Rn //only valid if Rd last
ADD Rn, Rs, Imm //only valid if Rd first
Though, these cases don't count in the vote as they would have required
more involved parsing. These could be used as "keys" tough, as ASM
parsing would fail (resulting in a compiler error) if in the wrong mode.

Note that in the ASM parsing "(R4, 16)" and "16(R4)" are considered functionally equivalent. If I wanted, could also in theory add support
for Intel style "[R4+16]" style syntax.

In other news:
Was poking around and implemented a simplistic vaguely-MP3-like audio codec.

General:
Uses AdRice for the entropy coder;
Uses Block-Haar as the main transform;
As 2 levels of an 8-element Haar transform, for a 64-element block.
Groups of 4 center blocks and 1 side block form a larger 256 sample block;
Uses a "half-linear cubic spline" for low frequency components;
Multiple 256 sample blocks are encoded end-to-end into larger blocks
which are entropy-coded separately;
A group of headers are re-encoded occasionally, these give general
features like the encoded sample rate and main quantization tables
(though, quantization is primarily controlled by a dynamically encoded parameter, which encodes a fixed-point scale for the block encoded
per-block).

The audio is encoded relative to a spline, as with just the block-Haar
by itself, the results sounded kinda awful. Low frequencies resulted in significant blocking artifacts, and blocky stair-stepping sounds pretty
bad with audio.

I had set up the spline with the control points aligned with the edges
of the blocks. This initially made sense, but I have found that sounds
in a certain frequency range can cause the DC of the block to move significantly relative the spline (turning them into obvious square waves).

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Oct 21 21:10:20 2024

From Newsgroup: comp.arch

On Mon, 21 Oct 2024 20:13:20 +0000, BGB wrote:

On 10/17/2024 9:28 PM, MitchAlsup1 wrote:

Granted, I guess it could be decoded as if it were a normal 3RI op or
similar, but then split up the immediate into multiple parts in EX1.

Why would you want do make it 3×11-bit displacements when you can
make it 3×16-bit displacements.

    +------+-----+-----+----------------+
    | Bc   | 3W | Rt |   .lb_lo       |
    +------+-----+-----+----------------+
    |   .lb_zero       | .lb_hi        |
    +------------------+----------------+

Neither BJX2 nor RISC-V have the encoding space to pull this off...
Even in a clean-slate ISA, it would be a big ask.

If you remove compressed instructions from RISC-V, you have enough
room left over to put the entire My 66000 ISA. ... ... ...
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Mon Oct 21 19:38:41 2024

From Newsgroup: comp.arch

On 10/21/2024 4:10 PM, MitchAlsup1 wrote:

On Mon, 21 Oct 2024 20:13:20 +0000, BGB wrote:

On 10/17/2024 9:28 PM, MitchAlsup1 wrote:

Granted, I guess it could be decoded as if it were a normal 3RI op or
similar, but then split up the immediate into multiple parts in EX1.

Why would you want do make it 3×11-bit displacements when you can
make it 3×16-bit displacements.

     +------+-----+-----+----------------+
     | Bc   | 3W | Rt |   .lb_lo       |
     +------+-----+-----+----------------+
     |   .lb_zero       | .lb_hi        |
     +------------------+----------------+

Neither BJX2 nor RISC-V have the encoding space to pull this off...
   Even in a clean-slate ISA, it would be a big ask.

If you remove compressed instructions from RISC-V, you have enough
room left over to put the entire My 66000 ISA. ... ... ...

Likewise, could also fit more or less all of XG2 encoding space into the
space as well, if the bits were shuffled around to fit the encoding
space around RISC-V...

I could have considered this, vs my previous BSR4I idea...
Pro:
Could potentially leverage my existing BJX2 decoders;
BSR4I would have needed new decoders.
Con:
Possibly a bigger dog-chewed mess than my existing encoding.
The BJX2 ISA is still a bit more complicated than RV;
Would still need the resource cost of more decoders.

Say:
NMOP-YwYY-nnnn-mmmm ZZZZ-Qnmo-oooo-XXXX (F0)
NMOP-YwYY-nnnn-mmmm ZZZZ-Qnmo-oooo-oooo (F1/F2)
NZZP-YwYY-nnnn-ZZZn iiii-iiii-iiii-iiii (F8)

Possible Repack:
XXXX-oooo-oomm-mmmm-ZZZZ-nnnn-nnQY-YYPw (F0)
oooo-oooo-oomm-mmmm-ZZZZ-nnnn-nnQY-YYPw (F1/F2)
iiii-iiii-iiii-iiii-ZZZZ-nnnn-nnZY-YYPw (F8)
00: OP?T
01: OP?F
10: OP
11: RV OP32

If I did so though, would likely:
Drop FA and FB blocks, and rework the F8 block
Implicitly, WEX and PrWEX are dropped;
Would need to use superscalar.
The FA and FB blocks would take over the Jumbo-Prefix role.

Likely:
Special case F8 so that it makes sense;
Special case F1 and F2 so that immediate bits are contiguous;
May make sense to relocate BRA and BSR from F0 to F8.
Likely reduced from 23 to 22 bits.

Where, YYY:
000: F0 (3R ops)
001: F1 (LD/ST Disp10)
010: F2 (3RI Imm10 Ops)
011: F3 (Reserved / User)
100: F8 (Imm16 ops)
101: F9 (Reserved)
110: FE (Jumbo Prefix)
111: FF (Jumbo Prefix)

Probably using a variation of XG2RV rules (IOW: Uses same register space
and ABI as RISC-V).

Ironically, repacking XG2 to fit into the RV encoding space might
actually be easier than trying to expand RISC-V register fields to 6
bits and fit it into the same space.

If doing so, it would likely make sense to only carry over certain
encoding blocks, say:
0z-000 -> 000: LD / ST (O select)
11-000 -> 001: BEQ
11-000 -> 010: -
11-000 -> 011: -
01-100 -> 100: ALU
01-110 -> 101: ALUW
10-100 -> 110: FPU
00-1z0 -> 111: ALUI / ALUIW (O select)

ZZZZZZZ-ooooo-mmmmm-ZZZ-nnnnn-nm-YYY0o

Where, say:
0z: RV, Expanded 6b
10: -
11: Original RV OP32

Or, more aggressive:
0z-000 -> 00: LD / ST (O select)
00-1z0 -> 01: ALUI / ALUIW (O select)
01-100 -> 10: ALU
01-110 -> 11: ALUW

ZZZZZZZ-ooooo-mmmmm-ZZZ-nnnnn-YY-nmo00

Where:
00: RV, Expanded 6b
01: -
10: -
11: Original RV OP32

Though, the top 4 blocks of RV is probably less useful than nearly the
entire XG2 ISA...

Though, not sure how well "Repacked XG2RV hot glued onto RISC-V" would
go over.

Would still have the downside of needing a special/separate operating
mode. Well, and the wonk that it would still be essentially two ISA's awkwardly glued together.

But, then again, there seems to still be a roughly 19% performance delta between my current extended RISC-V and XG2 when it comes to running
Doom. As, sadly, Jumbo Prefixes and Indexed Load/Store were still not
enough to entirely close the gap.

Eg:
XG2 : 25 fps
RV+J : 21 fps
RV64G (GCC): 17 fps

Implementation would be easier, in that it would be mostly "take
existing ISA and shuffle the bits around" on the encoder and decoder sides.

Some people really like the C extension though, but granted, it makes
more sense for microcontrollers.

IME, performance oriented code isn't really limited by I$ miss rate. I$
misses are a bigger issue with 4K or 8K I$, but much less of an issue
with 32K I$.

Well, and also XG2 is currently managing to be smaller than RV64GC as
well, as fewer instructions is saving more than "common instructions
using less space" (like, 'C' saves 35%, but avoiding most of the cases
that need multi-instruction sequences saves 60%, ...).

Jumbo prefixes and similar help, but would still need to shave off
another 20% here.

...

--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Microbot
  Tue Nov 26 18:01:41 2024
  from Moore, Ok via Telnet
- Microbot
  Mon Nov 25 19:29:30 2024
  from Moore, Ok via Telnet
- Winston
  Mon Nov 25 10:05:27 2024
  from Kerrville, Tx via SSH
- Mousepad
  Sun Nov 24 23:54:38 2024
  from Green Bay, Wi via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	993
Nodes:	10 (0 / 10)
Uptime:	207:44:09
Calls:	12,972
Calls today:	1
Files:	186,574
Messages:	3,268,393

Re: Misc: BGBCC targeting RV64G, initial results...

Who's Online

Recent Visitors

System Info