With sequential decode, I suppose I could site immediate values after
the instruction proper, but I've found that I do not have to do that, I
can have them within the instruction body as normally indicated by its leading bits - with only one bit of awkwardness for the 64-bit
immediates.
Concertina III is described at http://www.quadibloc.com/arch/cy01int.htm
After realizing that Mitch Alsup was right in that there was no real
benefit in speeding up instruction decode in the manner I was trying to achieve with the use of block headers, I had tried, by going from banks of 32 registers to banks of 16 registers, to move to variable-length instructions.
For some reason, though, I couldn't make it work. It seemed like it
should, but I couldn't get the 16-bit instructions to fit.
Well, I've made another attempt. And it seems like going to banks of 16 registers is indeed sufficient (retaining, from Concertina II, the
artifice of only using seven registers as base registers and another seven as index registers) to fit an instruction set as complete as the one I'm aiming for in the available opcode space.
Of course, this does give up VLIW functionality. But while VLIW may not be
a true failure, where it works is in small-scale embedded processors. So
I'm not going to worry about attempting to use VLIW as a more conventional alternative to Ivan Godard's more radical Mill design.
With sequential decode, I suppose I could site immediate values after the instruction proper, but I've found that I do not have to do that, I can
have them within the instruction body as normally indicated by its leading bits - with only one bit of awkwardness for the 64-bit immediates.
Concertina III is described at--- Synchronet 3.22a-Linux NewsLink 1.2
http://www.quadibloc.com/arch/cy01int.htm
John Savard
The experience of both led me to My 66000 that simply appends constants
to the instructions (1 constant per 1 instruction). The VLI decoder is 6 gates and 2 gates of delay.) this has worked out so well, that I
encourage others to follow suit (or outright copy...)
quadi <quadibloc@ca.invalid> posted:
With sequential decode, I suppose I could site immediate values after the
instruction proper, but I've found that I do not have to do that, I can
have them within the instruction body as normally indicated by its leading >> bits - with only one bit of awkwardness for the 64-bit immediates.
In K9, we used a packet cache of 8 instructions per fetch, and used a
scheme called "vertical neighbor" to hold non-8-bit immediates.
In Mc88120 we just executed the SETHI and OP instructions to paste bits >together.
The experience of both led me to My 66000 that simply appends constants
to the instructions (1 constant per 1 instruction). The VLI decoder is
6 gates and 2 gates of delay.) this has worked out so well, that I
encourage others to follow suit (or outright copy...)
quadi <quadibloc@ca.invalid> posted:
After realizing that Mitch Alsup was right in that there was no real
benefit in speeding up instruction decode in the manner I was trying to
achieve with the use of block headers, I had tried, by going from banks of >> 32 registers to banks of 16 registers, to move to variable-length
instructions.
See Gould S.E.L 32/87 (or /67) for ideas to save a few bits here and there along the lines of base registers and register segmentation.
For some reason, though, I couldn't make it work. It seemed like it
should, but I couldn't get the 16-bit instructions to fit.
Well, I've made another attempt. And it seems like going to banks of 16
registers is indeed sufficient (retaining, from Concertina II, the
artifice of only using seven registers as base registers and another seven >> as index registers) to fit an instruction set as complete as the one I'm
aiming for in the available opcode space.
Of course, this does give up VLIW functionality. But while VLIW may not be >> a true failure, where it works is in small-scale embedded processors. So
I'm not going to worry about attempting to use VLIW as a more conventional >> alternative to Ivan Godard's more radical Mill design.
Is there any "real" or even "useful" advantage of VLIW ??? Given the number of attempts and no real long-lasting results, history should be your guide.
With sequential decode, I suppose I could site immediate values after the
instruction proper, but I've found that I do not have to do that, I can
have them within the instruction body as normally indicated by its leading >> bits - with only one bit of awkwardness for the 64-bit immediates.
In K9, we used a packet cache of 8 instructions per fetch, and used a
scheme called "vertical neighbor" to hold non-8-bit immediates.
In Mc88120 we just executed the SETHI and OP instructions to paste bits together.
The experience of both led me to My 66000 that simply appends constants
to the instructions (1 constant per 1 instruction). The VLI decoder is
6 gates and 2 gates of delay.) this has worked out so well, that I
encourage others to follow suit (or outright copy...)
Concertina III is described at
http://www.quadibloc.com/arch/cy01int.htm
John Savard
On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:
The experience of both led me to My 66000 that simply appends constants
to the instructions (1 constant per 1 instruction). The VLI decoder is
6 gates and 2 gates of delay.) this has worked out so well, that I
encourage others to follow suit (or outright copy...)
That is an approach which does have an important advantage. Right now, I
only have immediates for the basic integer and floating-point
operations.
What about decimal floating-point immediates, for example? Appending
them to the instruction can be simple and orthogonal.
Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.
Could 14 bits be useful where
13 bits are doomed to fail,
On Sat, 16 May 2026 03:57:06 +0000, quadi wrote:
Could 14 bits be useful where 13 bits are doomed to fail,
Actually, though, I had worked out two ways where 16 bit short
instructions that all must start with 111 could perhaps do useful work.
The first one was:
111 + (seven bit opcode) + (3) + (3)
On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:
Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.
As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor. >Of course, though, he is hardly a disinterested source.
On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:
Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.
As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor.
Of course, though, he is hardly a disinterested source.
But the idea that putting bits in instructions to indicate that they can
be executed in parallel can enhance pipelining without the huge overhead
of out-of-order execution seems plausible to me. It's the same sort of argument that Ivan Godard made for his innovative Mill design. You've
noted, though, that unlike register hazards, cache misses, which are unpredictable by compilers, can be handled by a simpler form of OoO, the scoreboard of the 6600.
In a way, Concertina II is VLIW "perfected" - by putting the bits that indicate parallelism in a header at the start of the block, the price of indicating parallelism isn't a shorter instruction word, and hence having
to make do with fewer registers, or shorter displacement fields, all
things that do have an obvious negative impact on performance.
And by going from the block-oriented Concertina II design to the variable- length instruction Concertina III design, I've gone from banks of 32 registers to banks of 16 registers!
Did I have to do this?
In Concertina III, instructions longer than 32 bits take up 1/16 of the opcode space. Adding a bit so as to use 32 registers instead of 16 would change that to 1/8.
In Concertina II, the 32-bit instructions take up about 3/4 of the opcode space.
So an ISA without block structure, with variable-length instructions
instead, with banks of 32 registers is possible! However, only 1/8 of the opcode space would be left for short instructions, and 16-bit instructions with only 13 bits available... would be largely useless. If having the
option of using 16-bit instructions is the primary benefit of having variable-length instructions, instead of every instruction being 32 bits long... then attempting to obtain the best of Concertina II and III in a single design through this artifice... which seems so very tempting... is
a mistake.
Of course, the 360 managed to get by quite well with only 1/4 of the opcode space used by 16-bit instructions. Could 14 bits be useful where 13 bits
are doomed to fail, and if so, what contrivance could I possibly use to squeeze out that extra opcode space... since I've tried, and abandoned as fatally flawed, a _lot_ of contrivances to squeeze out space in just that
way in the development of Concertina II?
Block structure had the advantage of letting me pack more bits in instructions. That it let me offer VLIW, in the sense of controlling
parallel execution, as an option... was just gravy.
John Savard
According to quadi <quadibloc@ca.invalid>:
On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:
Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.
As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor. >> Of course, though, he is hardly a disinterested source.
It works great in programs where the compiler can predict the sequence of memory
references at compile time, much less well when the sequence is data dependent.
I can believe that video processing falls into the first category.
On 5/15/2026 10:57 PM, quadi wrote:
On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:
Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.
As I've noted, "the guy who invented VLIW" claimed, in a YouTube video, that a VLIW processor is highly successful as an embedded video processor. Of course, though, he is hardly a disinterested source.
But the idea that putting bits in instructions to indicate that they can
be executed in parallel can enhance pipelining without the huge overhead
of out-of-order execution seems plausible to me. It's the same sort of argument that Ivan Godard made for his innovative Mill design. You've noted, though, that unlike register hazards, cache misses, which are unpredictable by compilers, can be handled by a simpler form of OoO, the scoreboard of the 6600.
It is more VLIW vs In-Order, and In-Order vs OoO.
VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags.
+ Code does not depend on uArch.
In-Order vs Out-of-Order:
In-Order:
+ Simpler hardware
- Not as fast
OoO:
- Complex hardware (reorder buffer, scoreboard/renamer, ...)
+ Faster
Both VLIW and In-Order benefit from a large register file.
OoO mostly benefits ISA designs that would otherwise be slow.
Mostly absorbing the cost of a lot of the ISA level inefficiencies.
Theoretically, OoO can better absorb cache misses, however my own
testing implies that the delta vs "cache miss results in pipeline stall"
vs "delay instruction to hide miss" appears to be mostly negligible.
Also raw CPU speed doesn't matter as much when the computation is
primarily limited by RAM bandwidth or latency (seems to be a pretty
common scenario IME).
In my case, I had realized that In-Order could be handled nearly exactly
the same as my prior LIW handling (no real changes needed to the
pipeline, etc), with the primary change that the I$ can have logic to
detect which instructions can run in parallel during cache line fetch,
and doing this is in-effect cheap enough to be worthwhile (the in-orderWhen Rd-1 ~= either{SRC1-2, or SRC2-2}
not adding any significant resource cost over LIW).
So, in my case, 16 byte cache lines, in Op0..Op3:
Can Op0 co-execute with Op1?
Can Op1 co-execute with Op2?When Rd-1 ~= either{SRC1-3, or SRC2-3}
Can Op2 co-execute with Op3?When Rd-2 ~= either{SRC1-3, or SRC2-3}
Can Op0/1/2 co-execute?Depends on what is in Lane 2
Can Op1/2/3 co-execute?
----------
So, say, superscalar logic was a lookup over opcode bits for flags like:
can this op run in Lane 2?
Can this op run in Lane 3?...
Can this op run with another op in Lane 2?
Can this op run with another op in Lane 3?
Does this op use Rd as a source?
Does this op use Rt as a source?
According to quadi <quadibloc@ca.invalid>:
On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:
Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.
As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor. >> Of course, though, he is hardly a disinterested source.
It works great in programs where the compiler can predict the sequence of memory
references at compile time, much less well when the sequence is data dependent.
I can believe that video processing falls into the first category.
On Sat, 16 May 2026 04:13:01 +0000, quadi wrote:
The first one was:
111 + (seven bit opcode) + (3) + (3)
I have finally realized that there is a way to turn the impossible goal
that seemed so tantalizingly close to achievement into something
possible.
Just add
11111 +
(break bit) +
(seven-bit opcode) +
(condition code bit) +
(five-bit destination register) +
(five-bit source register)
BGB <cr88192@gmail.com> posted:
On 5/15/2026 10:57 PM, quadi wrote:
On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:
Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be >>>> your guide.
As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor. >>> Of course, though, he is hardly a disinterested source.
But the idea that putting bits in instructions to indicate that they can >>> be executed in parallel can enhance pipelining without the huge overhead >>> of out-of-order execution seems plausible to me. It's the same sort of
argument that Ivan Godard made for his innovative Mill design. You've
noted, though, that unlike register hazards, cache misses, which are
unpredictable by compilers, can be handled by a simpler form of OoO, the >>> scoreboard of the 6600.
It is more VLIW vs In-Order, and In-Order vs OoO.
VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags.
+ Code does not depend on uArch.
In-Order vs Out-of-Order:
In-Order:
+ Simpler hardware
- Not as fast
OoO:
- Complex hardware (reorder buffer, scoreboard/renamer, ...)
+ Faster
S/Faster/Higher Performing/
Both VLIW and In-Order benefit from a large register file.
OoO mostly benefits ISA designs that would otherwise be slow.
Mostly absorbing the cost of a lot of the ISA level inefficiencies.
Theoretically, OoO can better absorb cache misses, however my own
testing implies that the delta vs "cache miss results in pipeline stall"
vs "delay instruction to hide miss" appears to be mostly negligible.
You do not have an execution pipeline with depth > L1 cache miss
latency. When you do, new effects become feasible--like beginning
the second next loop iteration before the first one has completed.
This is where you can now absorb the L1 cache miss latency.
Also raw CPU speed doesn't matter as much when the computation is
primarily limited by RAM bandwidth or latency (seems to be a pretty
common scenario IME).
In my case, I had realized that In-Order could be handled nearly exactly
the same as my prior LIW handling (no real changes needed to the
pipeline, etc), with the primary change that the I$ can have logic to
detect which instructions can run in parallel during cache line fetch,
When you do not have condition codes, and only 1 register file, you
can determine parallel-ness by simply looking at the registers.
and doing this is in-effect cheap enough to be worthwhile (the in-orderWhen Rd-1 ~= either{SRC1-2, or SRC2-2}
not adding any significant resource cost over LIW).
So, in my case, 16 byte cache lines, in Op0..Op3:
Can Op0 co-execute with Op1?
Can Op1 co-execute with Op2?When Rd-1 ~= either{SRC1-3, or SRC2-3}
Can Op2 co-execute with Op3?When Rd-2 ~= either{SRC1-3, or SRC2-3}
Can Op0/1/2 co-execute?Depends on what is in Lane 2
Can Op1/2/3 co-execute?
----------
So, say, superscalar logic was a lookup over opcode bits for flags like:
can this op run in Lane 2?
Can this op run in Lane 3?...
Can this op run with another op in Lane 2?
Can this op run with another op in Lane 3?
Does this op use Rd as a source?
Does this op use Rt as a source?
Given nomenclature like Mc88120 where {
Lanes = {MEM0, MEM1, MEM2, FADD, FMUL, Branch}
And MEM has an integer unit, and a shift unit
FADD has an integer unit
FMUL has an integer unit
Branch has an integer unit }
And each unit is buffered with its own reservation station;
You just let the RSs create a solution.
Given nomenclature like M5 with >10 FUs, the calculation is harder,
but you still just let the RSs create the solution.
--------------
So I need additional opcode space for 48-bit instructions.
On Sun, 17 May 2026 15:24:20 +0000, quadi wrote:
So I need additional opcode space for 48-bit instructions.
I managed to find enough space for the 48-bit instructions without taking any from elsewhere.
However, I'm now encountering a problem with the 32-bit instructions.
Given how I'm handlng other sizes of immediates, I want all 32 registers
to be possible destinations for the 16-bit immediates.
This leads to an opcode space shortage for 32-bit operate instructions. There was a little slack in the existing 32-bit instructions that I could squeeze, but not enough.
The amount needed, though, is 1/3 the size of what the 16-bit short instructions take, or the same as what the 24-bit short instructions take.
Possible easy and obvious alternatives:
1) Drop the 24-bit short instructions, they're a weird length.
2) Go to 6-bit opcodes for the 16-bit short instructions, limiting them to the most important data types.
3) Stick with only 8 (or even only 16) registers as the destination of a 16-bit immediate.
Maybe I can squeeze more and avoid having to do any of them; if I must choose, (2) sounds like the most attractive, as a short instruction that--- Synchronet 3.22a-Linux NewsLink 1.2
can only work on the first 8 registers is disfavored anyways.
John Savard
It is more VLIW vs In-Order, and In-Order vs OoO.
VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags. >>> + Code does not depend on uArch.
On 5/17/2026 1:37 PM, BGB wrote:
snip
It is more VLIW vs In-Order, and In-Order vs OoO.
VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags. >>>> + Code does not depend on uArch.
Another disadvantage is less efficient memory utilization due to taking
up space for the template bits. It also causes a correspondingly less efficient memory bandwidth usage. This is particularly apparent in
EPIC, as they only get three instructions in 128 bits versus four in a traditional RISC (Although you could argue the longer instructions do
more, but this isn't proven.).
On 5/18/2026 12:57 PM, Stephen Fuld wrote:
On 5/17/2026 1:37 PM, BGB wrote:
snip
It is more VLIW vs In-Order, and In-Order vs OoO.
VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags. >>>>> + Code does not depend on uArch.
Another disadvantage is less efficient memory utilization due to
taking up space for the template bits. It also causes a
correspondingly less efficient memory bandwidth usage. This is
particularly apparent in EPIC, as they only get three instructions in
128 bits versus four in a traditional RISC (Although you could argue
the longer instructions do more, but this isn't proven.).
That is not the only way of encoding it.
I ended up with a 2-bit pattern in each 32-bit instruction:
00: ?T (Conditional)
01: ?F (Conditional)
10: Scalar / Final
11: WEX (Non-Final)
On 5/18/2026 10:40 PM, BGB wrote:
On 5/18/2026 12:57 PM, Stephen Fuld wrote:
On 5/17/2026 1:37 PM, BGB wrote:
snip
It is more VLIW vs In-Order, and In-Order vs OoO.
VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags.
+ Code does not depend on uArch.
Another disadvantage is less efficient memory utilization due to
taking up space for the template bits. It also causes a
correspondingly less efficient memory bandwidth usage. This is
particularly apparent in EPIC, as they only get three instructions in
128 bits versus four in a traditional RISC (Although you could argue
the longer instructions do more, but this isn't proven.).
That is not the only way of encoding it.
Of course. I was merely citing a particularly bad example.
I ended up with a 2-bit pattern in each 32-bit instruction:
00: ?T (Conditional)
01: ?F (Conditional)
10: Scalar / Final
11: WEX (Non-Final)
So you have reduced the loss of memory efficiency to about 6% (4/32).
Certainly smaller, but still there. BTW, I don't understand the conditionals. On what basis do they make their decision? Are they part
of some predication scheme?
On 5/18/2026 10:40 PM, BGB wrote:
On 5/18/2026 12:57 PM, Stephen Fuld wrote:
On 5/17/2026 1:37 PM, BGB wrote:
snip
It is more VLIW vs In-Order, and In-Order vs OoO.
VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags.
+ Code does not depend on uArch.
Another disadvantage is less efficient memory utilization due to
taking up space for the template bits. It also causes a
correspondingly less efficient memory bandwidth usage. This is
particularly apparent in EPIC, as they only get three instructions in
128 bits versus four in a traditional RISC (Although you could argue
the longer instructions do more, but this isn't proven.).
That is not the only way of encoding it.
Of course. I was merely citing a particularly bad example.
I ended up with a 2-bit pattern in each 32-bit instruction:
00: ?T (Conditional)
01: ?F (Conditional)
10: Scalar / Final
11: WEX (Non-Final)
So you have reduced the loss of memory efficiency to about 6% (4/32). Certainly smaller, but still there. BTW, I don't understand the conditionals. On what basis do they make their decision? Are they part
of some predication scheme?
On 5/19/2026 1:22 PM, Stephen Fuld wrote:
On 5/18/2026 10:40 PM, BGB wrote:
On 5/18/2026 12:57 PM, Stephen Fuld wrote:
On 5/17/2026 1:37 PM, BGB wrote:
snip
It is more VLIW vs In-Order, and In-Order vs OoO.
VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags.
+ Code does not depend on uArch.
Another disadvantage is less efficient memory utilization due to
taking up space for the template bits. It also causes a
correspondingly less efficient memory bandwidth usage. This is
particularly apparent in EPIC, as they only get three instructions in >>> 128 bits versus four in a traditional RISC (Although you could argue
the longer instructions do more, but this isn't proven.).
That is not the only way of encoding it.
Of course. I was merely citing a particularly bad example.
I ended up with a 2-bit pattern in each 32-bit instruction:
00: ?T (Conditional)
01: ?F (Conditional)
10: Scalar / Final
11: WEX (Non-Final)
So you have reduced the loss of memory efficiency to about 6% (4/32). Certainly smaller, but still there. BTW, I don't understand the conditionals. On what basis do they make their decision? Are they part of some predication scheme?
It cost 2 bits from a 32-bit instruction, granted.
The conditionals depended on the status of a single status bit:
?T: Execute if SR.T is Set, Typically No-Op if Clear (*1)
?F: Execute if SR.T is Clear, Typically No-Op if Set
Explicit predication also allows the compiler to reshuffle the then/else branches together to better fit into the pipeline (though, generally the
non-executed instructions still effect the pipeline flow as-if they were executed; so things like register RAW dependencies and similar still
apply even if only the "actually executed" instructions will visibly manifest results at the end). So, they are like ghost instructions which still function as-if they were real instructions, but outputs are
suppressed along with any other side-effects.
While arguably it might be better if the non-executed instructions could disappear as-if they never existed in the first place (not consuming
clock cycles or pipeline lanes), this isn't really possible with the existing pipeline.
But, either way, predicating a short if/else branch is faster on average than having a conditional branch to skip over it (and by the time the
limits of the predication scheme become more of an issue, it is
typically already time to switch over to using a conditional branch).
BGB <cr88192@gmail.com> posted:
On 5/19/2026 1:22 PM, Stephen Fuld wrote:
On 5/18/2026 10:40 PM, BGB wrote:
On 5/18/2026 12:57 PM, Stephen Fuld wrote:
On 5/17/2026 1:37 PM, BGB wrote:
snip
It is more VLIW vs In-Order, and In-Order vs OoO.
VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags.
+ Code does not depend on uArch.
Another disadvantage is less efficient memory utilization due to
taking up space for the template bits. It also causes a
correspondingly less efficient memory bandwidth usage. This is
particularly apparent in EPIC, as they only get three instructions in >>>>> 128 bits versus four in a traditional RISC (Although you could argue >>>>> the longer instructions do more, but this isn't proven.).
That is not the only way of encoding it.
Of course. I was merely citing a particularly bad example.
I ended up with a 2-bit pattern in each 32-bit instruction:
00: ?T (Conditional)
01: ?F (Conditional)
10: Scalar / Final
11: WEX (Non-Final)
So you have reduced the loss of memory efficiency to about 6% (4/32).
Certainly smaller, but still there. BTW, I don't understand the
conditionals. On what basis do they make their decision? Are they part >>> of some predication scheme?
It cost 2 bits from a 32-bit instruction, granted.
If you still have Scalar and WEX it only cost 1-bit.
The conditionals depended on the status of a single status bit:
?T: Execute if SR.T is Set, Typically No-Op if Clear (*1)
?F: Execute if SR.T is Clear, Typically No-Op if Set
It also costs the flags as bits.
I should note: Predication in My 66000 costs the predicated instruction 0-bits.
------------------
Explicit predication also allows the compiler to reshuffle the then/else
branches together to better fit into the pipeline (though, generally the
We taught the LLVM compiler to always place the then-clause first and the else-clause second without finding a compiler problem.
The HW does not switch back and forth between clauses, making register tracking in GBOoO fairly easy.
non-executed instructions still effect the pipeline flow as-if they were
executed; so things like register RAW dependencies and similar still
I found this unnecessary
apply even if only the "actually executed" instructions will visibly
manifest results at the end). So, they are like ghost instructions which
still function as-if they were real instructions, but outputs are
suppressed along with any other side-effects.
While arguably it might be better if the non-executed instructions could
disappear as-if they never existed in the first place (not consuming
clock cycles or pipeline lanes), this isn't really possible with the
existing pipeline.
Your problem, yet you see it as an advantage ?!?!
But, either way, predicating a short if/else branch is faster on average
than having a conditional branch to skip over it (and by the time the
limits of the predication scheme become more of an issue, it is
typically already time to switch over to using a conditional branch).
Predication works when the FETCH unit gets to the join point before
DECODE gets to the else-clause. When fetching 128-bits/cycle in a 1
wide machine, this is at least 8 instructions (each clause). Wider
machines will have correspondingly wider FETCH.
On 5/20/2026 1:25 PM, MitchAlsup wrote:----------------------------
The HW does not switch back and forth between clauses, making register tracking in GBOoO fairly easy.
In my case, the predication scheme allows the compiler to reshuffle instructions to pack into the pipeline more efficiently.
non-executed instructions still effect the pipeline flow as-if they were >> executed; so things like register RAW dependencies and similar still
I found this unnecessary
Well, in my case it is a minor drawback:
One doesn't know whether the instruction is EX or No-EX until the EX
stages, but by this point the pipeline flow is essentially already
locked-in (to some extent, it is already locked in by the IF stage, but
IF isn't going to know yet whether or not these instructions will
actually run).
Your problem, yet you see it as an advantage ?!?!
Well, not this part...
This part is an annoyance, but no obvious way to make everything behave
as if it costs 0 cycles (or, less cycles than the "normal case").
BGB <cr88192@gmail.com> posted:
On 5/20/2026 1:25 PM, MitchAlsup wrote:----------------------------
The HW does not switch back and forth between clauses, making register
tracking in GBOoO fairly easy.
In my case, the predication scheme allows the compiler to reshuffle
instructions to pack into the pipeline more efficiently.
We do not do instruction scheduling, we let the instruction queues do
that kind of stuff. You must be caught in 1988 ...
--------------------
non-executed instructions still effect the pipeline flow as-if they were >>>> executed; so things like register RAW dependencies and similar still
I found this unnecessary
Well, in my case it is a minor drawback:
One doesn't know whether the instruction is EX or No-EX until the EX
stages, but by this point the pipeline flow is essentially already
locked-in (to some extent, it is already locked in by the IF stage, but
IF isn't going to know yet whether or not these instructions will
actually run).
Its a simple problem for Reservation Stations to handle.
--------------
Your problem, yet you see it as an advantage ?!?!
Well, not this part...
This part is an annoyance, but no obvious way to make everything behave
as if it costs 0 cycles (or, less cycles than the "normal case").
Key word "obvious" but then again you spell both load and store as MOV.
On 5/20/2026 5:51 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 5/20/2026 1:25 PM, MitchAlsup wrote:----------------------------
The HW does not switch back and forth between clauses, making register >>>> tracking in GBOoO fairly easy.
In my case, the predication scheme allows the compiler to reshuffle
instructions to pack into the pipeline more efficiently.
We do not do instruction scheduling, we let the instruction queues do
that kind of stuff. You must be caught in 1988 ...
I am assuming in-order here.
Shuffling instructions in the compiler leads to higher performance than
just leaving them in whatever order they come out of the main codegen.
Then again, AFAIK OoO didn't really hit mainstream processors until the
late 1990s (eg, Pentium II and Pentium III).
BGB wrote:
On 5/20/2026 5:51 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 5/20/2026 1:25 PM, MitchAlsup wrote:----------------------------
The HW does not switch back and forth between clauses, making
register tracking in GBOoO fairly easy.
In my case, the predication scheme allows the compiler to
reshuffle instructions to pack into the pipeline more
efficiently.
We do not do instruction scheduling, we let the instruction queues
do that kind of stuff. You must be caught in 1988 ...
I am assuming in-order here.
Shuffling instructions in the compiler leads to higher performance
than just leaving them in whatever order they come out of the main
codegen.
Then again, AFAIK OoO didn't really hit mainstream processors until
the late 1990s (eg, Pentium II and Pentium III).
Please check your history!
The PentiumPro was the first ever mass-market OoO CPU, it arrived
around 1996.
PentiumI/III/MMX were just variation on the original Pentium which introduced superscalar in the form of the u and v pipes which could
execute two instructions at once,
IFF you aligned them properly and
selected a simple instruction for the v pipe.
Terje
When you do not have condition codes, and only 1 register file, you
can determine parallel-ness by simply looking at the registers.
On 5/20/2026 5:51 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 5/20/2026 1:25 PM, MitchAlsup wrote:----------------------------
The HW does not switch back and forth between clauses, making register >>> tracking in GBOoO fairly easy.
In my case, the predication scheme allows the compiler to reshuffle
instructions to pack into the pipeline more efficiently.
We do not do instruction scheduling, we let the instruction queues do
that kind of stuff. You must be caught in 1988 ...
I am assuming in-order here.
Shuffling instructions in the compiler leads to higher performance than
just leaving them in whatever order they come out of the main codegen.
Then again, AFAIK OoO didn't really hit mainstream processors until the
late 1990s (eg, Pentium II and Pentium III).
As for me, in 1988 I would still have been very young. I think these
being mostly the "sit around and watch cartoons" years (with my "K12"
years mostly spanning the 1990s and early 2000s).
But, I am getting on in years, having existed for over 4 decades now...
--------------------
non-executed instructions still effect the pipeline flow as-if they were >>>> executed; so things like register RAW dependencies and similar still
I found this unnecessary
Well, in my case it is a minor drawback:
One doesn't know whether the instruction is EX or No-EX until the EX
stages, but by this point the pipeline flow is essentially already
locked-in (to some extent, it is already locked in by the IF stage, but
IF isn't going to know yet whether or not these instructions will
actually run).
Its a simple problem for Reservation Stations to handle.
Once again, CPU is in-order.
--------------
Your problem, yet you see it as an advantage ?!?!
Well, not this part...
This part is an annoyance, but no obvious way to make everything behave
as if it costs 0 cycles (or, less cycles than the "normal case").
Key word "obvious" but then again you spell both load and store as MOV.
This is an assembler design choice...// style BGB inherited
But, then again:
MOV EAX, DWORD PTR [EDX] //Intel style
MOV.L 0(%EDX), %EAX //AT&T / GAS style
MOV.L 0(%A3), %D0 //M68K style
MOV.L @R3, R2 //SuperH style
MOV.L (R3), R2 //style I went with
...
LW X10, 0(X13) //RISC-V style
A lot of targets still use MOV...
The assembler in BGBCC also accepts RISC-V notation (and I almost
considered switching over), but I mostly ended up sticking with the
former style syntax due to inertia (and there is a "non-zero friction
cost" related to which ASM syntax one uses).
BGB wrote:
On 5/20/2026 5:51 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 5/20/2026 1:25 PM, MitchAlsup wrote:----------------------------
The HW does not switch back and forth between clauses, making register >>>>> tracking in GBOoO fairly easy.
In my case, the predication scheme allows the compiler to reshuffle
instructions to pack into the pipeline more efficiently.
We do not do instruction scheduling, we let the instruction queues do
that kind of stuff. You must be caught in 1988 ...
I am assuming in-order here.
Shuffling instructions in the compiler leads to higher performance
than just leaving them in whatever order they come out of the main
codegen.
Then again, AFAIK OoO didn't really hit mainstream processors until
the late 1990s (eg, Pentium II and Pentium III).
Please check your history!
The PentiumPro was the first ever mass-market OoO CPU, it arrived around 1996.
PentiumI/III/MMX were just variation on the original Pentium which introduced superscalar in the form of the u and v pipes which could
execute two instructions at once, IFF you aligned them properly and
selected a simple instruction for the v pipe.
Terje
BGB <cr88192@gmail.com> posted:
On 5/20/2026 5:51 PM, MitchAlsup wrote:// style BGB inherited
BGB <cr88192@gmail.com> posted:
On 5/20/2026 1:25 PM, MitchAlsup wrote:----------------------------
The HW does not switch back and forth between clauses, making register >>>>> tracking in GBOoO fairly easy.
In my case, the predication scheme allows the compiler to reshuffle
instructions to pack into the pipeline more efficiently.
We do not do instruction scheduling, we let the instruction queues do
that kind of stuff. You must be caught in 1988 ...
I am assuming in-order here.
Shuffling instructions in the compiler leads to higher performance than
just leaving them in whatever order they come out of the main codegen.
Then again, AFAIK OoO didn't really hit mainstream processors until the
late 1990s (eg, Pentium II and Pentium III).
As for me, in 1988 I would still have been very young. I think these
being mostly the "sit around and watch cartoons" years (with my "K12"
years mostly spanning the 1990s and early 2000s).
But, I am getting on in years, having existed for over 4 decades now...
--------------------
non-executed instructions still effect the pipeline flow as-if they were >>>>>> executed; so things like register RAW dependencies and similar still >>>>>I found this unnecessary
Well, in my case it is a minor drawback:
One doesn't know whether the instruction is EX or No-EX until the EX
stages, but by this point the pipeline flow is essentially already
locked-in (to some extent, it is already locked in by the IF stage, but >>>> IF isn't going to know yet whether or not these instructions will
actually run).
Its a simple problem for Reservation Stations to handle.
Once again, CPU is in-order.
--------------
Your problem, yet you see it as an advantage ?!?!
Well, not this part...
This part is an annoyance, but no obvious way to make everything behave >>>> as if it costs 0 cycles (or, less cycles than the "normal case").
Key word "obvious" but then again you spell both load and store as MOV.
This is an assembler design choice...
But, then again:
MOV EAX, DWORD PTR [EDX] //Intel style
MOV.L 0(%EDX), %EAX //AT&T / GAS style
MOV.L 0(%A3), %D0 //M68K style
MOV.L @R3, R2 //SuperH style
MOV.L (R3), R2 //style I went with
...
LW X10, 0(X13) //RISC-V style
A lot of targets still use MOV...
MOV implies that the data is unaltered, while LD implies the memory
value is expanded to fill out the whole register, and ST implies
the register values is chopped off to fit in the memory container.
Expansion means sign or zero extend, chop means bits are ignored.
The assembler in BGBCC also accepts RISC-V notation (and I almost
considered switching over), but I mostly ended up sticking with the
former style syntax due to inertia (and there is a "non-zero friction
cost" related to which ASM syntax one uses).
On Thu, 21 May 2026 16:05:10 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
BGB wrote:
On 5/20/2026 5:51 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 5/20/2026 1:25 PM, MitchAlsup wrote:----------------------------
The HW does not switch back and forth between clauses, making
register tracking in GBOoO fairly easy.
In my case, the predication scheme allows the compiler to
reshuffle instructions to pack into the pipeline more
efficiently.
We do not do instruction scheduling, we let the instruction queues
do that kind of stuff. You must be caught in 1988 ...
I am assuming in-order here.
Shuffling instructions in the compiler leads to higher performance
than just leaving them in whatever order they come out of the main
codegen.
Then again, AFAIK OoO didn't really hit mainstream processors until
the late 1990s (eg, Pentium II and Pentium III).
Please check your history!
The PentiumPro was the first ever mass-market OoO CPU, it arrived
around 1996.
PentiumI/III/MMX were just variation on the original Pentium which
introduced superscalar in the form of the u and v pipes which could
execute two instructions at once,
Pentium-MMX (P55C) is indeed variant of Pentium, with not insignificant microarchitectural changes (1 stage longer piplene, sligtly released
pairing rules, different dedoder that does not depend on instruction bounderies marks in the cache).
But Pentium III is P6, next generation of Pentium II with almost
identical core uArch.
IFF you aligned them properly and
selected a simple instruction for the v pipe.
P6 had its own problems, not with execution phase but with decoders.
Google for 4-1-1.
Either way, basic point still stands:
1996 is still much later than 1988...
Though, looking stuff up:You were right, Michael S already showed me that I was wrong re
Seems PentiumPro was intended for the workstation market;
Its market share was (AFAIK) not as big as the Pentium II or III.
Whereas Pentium II was a more consumer marketed chip,
and widely sold.
Stuff I am reading says that PentiumPro, Pentium II, and Pentium III
were all based on the P6 architecture.
In contrast to the Pentium I and Pentium MMX, which were P5.The P4 was an amazing design, running the core at 2 X the official clock speed but only doing half an operation per half-cycle.
Or Pentium 4, which was NetBurst.
Then Intel later brought back a P6 variant for the Intel Core Micro-Architecture, after NetBurst was an epic fail.
BGB wrote:
Either way, basic point still stands:
1996 is still much later than 1988...
Though, looking stuff up:
Seems PentiumPro was intended for the workstation market;
Its market share was (AFAIK) not as big as the Pentium II or III. >> Whereas Pentium II was a more consumer marketed chip,
and widely sold.
Stuff I am reading says that PentiumPro, Pentium II, and Pentium III were all based on the P6 architecture.
You were right, Michael S already showed me that I was wrong re PentiumII/III.
In contrast to the Pentium I and Pentium MMX, which were P5.
Or Pentium 4, which was NetBurst.
Then Intel later brought back a P6 variant for the Intel Core Micro-Architecture, after NetBurst was an epic fail.
The P4 was an amazing design, running the core at 2 X the official clock speed but only doing half an operation per half-cycle.
It ran like the proverbial bat out of hell when everything aligned properly, but slammed into a wall every time it had to leave the fast inner core and switch to normal processing, i.e for stuff like SHR.
Afair, it also blew up integer MUL at the same time as shifts became much slower, which meant that many previously cheap addressing operations now because much slower.
Terje
BGB wrote:
Either way, basic point still stands:
1996 is still much later than 1988...
Though, looking stuff up:
Seems PentiumPro was intended for the workstation market;
Its market share was (AFAIK) not as big as the Pentium II or III.
Whereas Pentium II was a more consumer marketed chip,
and widely sold.
Stuff I am reading says that PentiumPro, Pentium II, and Pentium III
were all based on the P6 architecture.
You were right, Michael S already showed me that I was wrong re PentiumII/III.
In contrast to the Pentium I and Pentium MMX, which were P5.
Or Pentium 4, which was NetBurst.
Then Intel later brought back a P6 variant for the Intel Core Micro-Architecture, after NetBurst was an epic fail.
The P4 was an amazing design, running the core at 2 X the official clock speed but only doing half an operation per half-cycle.
It ran like the proverbial bat out of hell when everything aligned
properly, but slammed into a wall every time it had to leave the fast
inner core and switch to normal processing, i.e for stuff like SHR.
Afair, it also blew up integer MUL at the same time as shifts became
much slower, which meant that many previously cheap addressing
operations now because much slower.
Terje
The P4 was an amazing design, running the core at 2 X the official clock =
speed but only doing half an operation per half-cycle.
It ran like the proverbial bat out of hell when everything aligned=20 >properly, but slammed into a wall every time it had to leave the fast=20 >inner core and switch to normal processing, i.e for stuff like SHR.
Afair, it also blew up integer MUL at the same time as shifts became=20
much slower, which meant that many previously cheap addressing=20
operations now because much slower.
Replay: Unknown Features of the NetBurst Core >https://web.archive.org/web/20160306140603/http://www.xbitlabs.com/articles/cpu/print/replay.html
BGB wrote:
Then again, AFAIK OoO didn't really hit mainstream processors until the
late 1990s (eg, Pentium II and Pentium III).
Please check your history!
The PentiumPro was the first ever mass-market OoO CPU, it arrived around 1996.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Replay: Unknown Features of the NetBurst Core >https://web.archive.org/web/20160306140603/http://www.xbitlabs.com/articles/cpu/print/replay.html
Yes, xbitlabs was a good site. Apparently there is not enough
interest in these topics to support such sites when they try to do it professionally.
Anyway, yes, that's a detailed description of replays, and now I
remember that I have already seen this (certainly I remember seeing
something about RL-7 and RL-12).. Unfortunately, in a number of the
more involved places it is not clear enough (or maybe I invested not
enough time into understanding it), so I only took away a general
impression.
In any case, one thing I wonder about is the microbenchmark that
results in Pic. 1: With just dependent loads (and no adds), it already
has 19 cycles of latency instead of the 9 cycles of latency to L2.
How come?
Ok, the second load takes one round through the replay--- Synchronet 3.22a-Linux NewsLink 1.2
loop, and in a long line of loads, serveral loads will travel through
RL-7 at all times. And I guess, some will cause a delay of a load
entering the dispatch, but that should add one cycle of latency now
and then. How do we get 19? The increase to 32 cycles if one add is involved also seems extreme. Somehow I still have not figured out
some important part of these replay loops.
Then again, with 9 cycles of L2 latency and 40 0.5-cycle adds in one dependence chain, I would expect a minimum latency of 27 cycles, but
Pic.1 shows <22 cycles. How come?
- anton
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,118 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 39:22:51 |
| Calls: | 14,340 |
| Files: | 186,357 |
| D/L today: |
23,670 files (7,691M bytes) |
| Messages: | 2,532,986 |