ARM64 (ARMv8) architecturally supports 4k, 16k and 64k.
These days it doesn't make much sense to have pages smaller than 4K since >that's the block size on most disks.
John Levine <johnl@taugh.com> writes:
These days it doesn't make much sense to have pages smaller than 4K since >>that's the block size on most disks.
Two block devices bought less than a year ago:
Disk model: KINGSTON SEDC2000BM8960G
Disk model: WD Blue SN580 2TB
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Disk model: WD Blue SN580 2TB
I can't find anything on its internal structure but I see the vendor's random >read/write benchmarks all use 4K blocks so that's probably the internal block >size.
On 8/15/2025 11:19 AM, BGB wrote:
On 8/15/2025 11:53 AM, John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to >>>>>> have "typically 97% hit rate". I would go for larger pages, which >>>>>> would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
That's probably it but even at the time the pages seemed rather small.
Pages on the PDP-10 were 512 words which was about 2K bytes.
Yeah.
Can note in some of my own testing, I tested various page sizes, and
seemingly found a local optimum at around 16K.
I think that is consistent with what some others have found. I suspect
the average page size should grow as memory gets cheaper, which leads to more memory on average in systems. This also leads to larger programs,
as they can "fit" in larger memory with less paging. And as disk
(spinning or SSD) get faster transfer rates, the cost (in time) of
paging a larger page goes down. While 4K was the sweet spot some
decades ago, I think it has increased, probably to 16K. At some point
in the future, it may get to 64K, but not for some years yet.
Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
but 16K to 32K or 64K did not see any significant reduction; but did
see a more significant increase in memory footprint due to allocation
overheads (where, OTOH, going from 4K to 16K pages does not see much
increase in memory footprint).
Patterns seemed consistent across multiple programs tested, but harder
to say if this pattern would be universal.
Had noted if running stats on where in the pages memory accesses land:
4K: Pages tend to be accessed fairly evenly
16K: Minor variation as to what parts of the page are being used.
64K: Significant variation between parts of the page.
Basically, tracking per-page memory accesses on a finer grain boundary
(eg, 512 bytes).
Interesting.
Say, for example, at 64K one part of the page may be being accessed
readily but another part of the page isn't really being accessed at
all (and increasing page size only really sees benefit for TLB miss
rate so long as the whole page is "actually being used").
Not necessarily. Consider the case of a 16K (or larger) page with two
"hot spots" that are more than 4K apart. That takes 2 TLB slots with 4K pages, but only one with larger pages.
John Levine <johnl@taugh.com> writes:
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Yes. But if the argument had any merit that 512B is a good page size
because it avoids having to transfer 8, 16, or 32 sectors at a time,
it would still have merit, because the interface still shows 512B
sectors.
John Levine <johnl@taugh.com> writes:
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Yes. But if the argument had any merit that 512B is a good page size
because it avoids having to transfer 8, 16, or 32 sectors at a time,
it would still have merit, because the interface still shows 512B
sectors.
EricP <ThatWouldBeTelling@thevillage.com> writes:
EricP wrote:
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.
I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline
running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
(using some in-order superscalar ideas and two reg file write ports
to "catch up" after pipeline bubbles).
TTL risc would also be much cheaper to design and prototype.
VAX took hundreds of people many many years.
The question is could one build this at a commercially competitive price?
There is a reason people did things sequentially in microcode.
All those control decisions that used to be stored as bits in microcode now >> become real logic gates. And in SSI TTL you don't get many to the $.
And many of those sequential microcode states become independent concurrent >> state machines, each with its own logic sequencer.
I am confused. You gave a possible answer in the posting you are
replying to.
Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
- anton
It is maybe pushing it a little if one wants to use an AVL-tree or
B-Tree for virtual memory vs a page-table
On 8/7/2025 6:38 AM, Anton Ertl wrote:
Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
Yeah, this approach works a lot better than people seem to give it
credit for...
BGB wrote:
On 8/7/2025 6:38 AM, Anton Ertl wrote:
Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
Yeah, this approach works a lot better than people seem to give it
credit for...
Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.
"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
BGB <cr88192@gmail.com> writes:
It is maybe pushing it a little if one wants to use an AVL-tree or
B-Tree for virtual memory vs a page-table
I assume that you mean a balanced search tree (binary (AVL) or n-ary
(B)) vs. the now-dominant hierarchical multi-level page tables, which
are tries.
In both a hardware and a software implementation, one could implement
a balanced search tree, but what would be the advantage?
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Unlesss... maybe somebody (a customer, or they themselves)Manufacturing process variation leads to timing differences that
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.
testing sorts into speed bins. The faster bins sell at higher price.
Is that possible with a PAL before it has been programmed?
Depends on what chips you use for registers.Should be free coming from a Flip-Flop.By comparison, you could get an eight-input NAND gate with aThe 82S100 PLA is logic equivalent to:
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
- 16 inputs each with an optional input invertor,
If you want both Q and Qb then you only get 4 FF in a package like 74LS375. >>
For a wide instruction or stage register I'd look at chips such as a 74LS377 >> with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.
So if you need eight ouputs, you choice is to use two 74LS375
(presumably more expensive) or an 74LS377 and an eight-chip
inverter (a bit slower, but intervers should be fast).
Another point... if you don't need 16 inputs or 8 outpus, youI'm just showing why it was more than just an AND gate.
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.
Two layers of NAND :-)
I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.
There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.
Agreed, the logic has to go somewhere. Regularity in the
instruction set would even have been extremely important than now
to reduce the logic requirements for decoding.
BGB wrote:
On 8/7/2025 6:38 AM, Anton Ertl wrote:
Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
Yeah, this approach works a lot better than people seem to give it
credit for...
Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.
In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf
"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
Recent studies show that TLB-related precise interrupts occur
once every 100–1000 user instructions on all ranges of code, from
SPEC to databases and engineering workloads [5, 18]."
On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
On Mon, 04 Aug 2025 09:53:51 -0700
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc maintainers
are wiser than that because, well, by chance gcc happens to be widely
used production compiler. I don't know why this time they had chosen
less conservative road.
They invented an identifer which lands in the _[A-Z].* namespace
designated as reserved by the standard.
What would be an exmaple of a more conservative way to name the
identifier?
EricP <ThatWouldBeTelling@thevillage.com> writes:
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss
... if the buffers fill up and there is not enough resources left for
the TLB miss handler.
- anton
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Unlesss... maybe somebody (a customer, or they themselves)Manufacturing process variation leads to timing differences that
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.
testing sorts into speed bins. The faster bins sell at higher price.
Is that possible with a PAL before it has been programmed?
They can speed and partially function test it.
Its programmed by blowing internal fuses which is a one-shot thing
so that function can't be tested.
Depends on what chips you use for registers.Should be free coming from a Flip-Flop.By comparison, you could get an eight-input NAND gate with aThe 82S100 PLA is logic equivalent to:
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
- 16 inputs each with an optional input invertor,
If you want both Q and Qb then you only get 4 FF in a package like
74LS375.
For a wide instruction or stage register I'd look at chips such as a
74LS377
with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable,
vcc, gnd.
So if you need eight ouputs, you choice is to use two 74LS375
(presumably more expensive) or an 74LS377 and an eight-chip
inverter (a bit slower, but intervers should be fast).
Another point... if you don't need 16 inputs or 8 outpus, youI'm just showing why it was more than just an AND gate.
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.
Two layers of NAND :-)
Thinking about different ways of doing this...
If the first NAND layer has open collector outputs then we can use
a wired-AND logic driving and invertor for the second NAND plane.
If the instruction buffer outputs to a set of 74159 4:16 demux with
open collector outputs, then we can just wire the outputs we want
together with a 10k pull-up resistor and drive an invertor,
to form the second output NAND layer.
inst buf <15:8> <7:0>
| | | |
4:16 4:16 4:16 4:16
vvvv vvvv vvvv vvvv
10k ---|---|---|---|------>INV->
10k ---------------------->INV->
10k ---------------------->INV->
I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.
There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.
Agreed, the logic has to go somewhere. Regularity in the
instruction set would even have been extremely important than now
to reduce the logic requirements for decoding.
The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.
Looking at the instruction set usage of VAX in
Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709
we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.
But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.
I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .
In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?
I can attempt to, though I'm not sure if I can be successful.
And so with the VAX, I can imagine the work (which started in,
what, 1975?) being informed by a business landscape that saw an
increasing trend towards favoring high-level languages, but also
saw the continued development of large, bespoke, business
applications for another five or more years, and with customers
wanting to be able to write (say) complex formatting sequences
easily in assembler (the EDIT instruction!), in a way that was
compatible with COBOL (so make the COBOL compiler emit the EDIT instruction!), while also trying to accommodate the scientific
market (POLYF/POLYG!) who would be writing primarily in FORTRAN
but jumping to assembler for the fuzz-busting speed boost (so
stabilize what amounts to an ABI very early on!), and so forth.
Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]
I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .
I'm not sure what you're referring to. You didn't say what foo is.
I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.
Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?
Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]
I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .
I'm not sure what you're referring to. You didn't say what foo is.
I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.
Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?
EricP <ThatWouldBeTelling@thevillage.com> writes:
BGB wrote:
On 8/7/2025 6:38 AM, Anton Ertl wrote:Both HW and SW table walkers incur the cost of reading the PTE's.
Concerning page table walker: The MIPS R2000 just has a TLB and trapsYeah, this approach works a lot better than people seem to give it
on a TLB miss, and then does the table walk in software. While that's >>>> not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
credit for...
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss
"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
with hardware table walking, on a 1000x1000 matrix multiply with
pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
cost about 20 cycles.
- anton
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.
The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.
EricP <ThatWouldBeTelling@thevillage.com> writes:the same problem.
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want >atomicity there.
On 18.08.2025 07:18, Keith Thompson wrote:
Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]
I am unsure how GCC --pedantic deals with the standards-contraryI'm not sure what you're referring to. You didn't say what foo is.
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .
I believe that in all versions of C, the result of a comma operator
has
the type and value of its right operand, and the type of an unprefixed
character constant is int.
Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?
Presumably that's a typo - you meant to ask when the size is /not/ the
size of "int" ? After all, you said yourself that "(foo, 'C')"
evaluates to 'C' which is of type "int". It would be very interesting
if Jakob can show an example where gcc treats the expression as any
other type than "int".
For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
Intel Lion Cove, I'd do the following modification to your inner loop
(back in Intel syntax):
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
adc edx,edx
add rax,[r9+rcx*8]
adc edx,0
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
inc edx
jmp edx_ready
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No >dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
inc edx
jmp edx_ready
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No
dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry
register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never
incremen_edx:
inc edx
jmp edx_ready
Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one iteration would be:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready
And the version with the two additional adc-using iterations would be
(with an additional correction):
mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready
and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It
all makes more sense with that. ebx then contains the carry from
the last cycle on entry. The carry dependency chain starts at
clearing edx, then gets to additional carries, then is copied to
ebx, transferred into the next iteration, and is ended there by
overwriting ebx. No dependency cycles (except the loop counter and >addresses, which can be dealt with by hardware or by unrolling), and
ebx contains the carry from the last iteration
One other problem is that according to Agner Fog's instruction
tables, even the latest and greatest CPUs from AMD and Intel that he >measured (Zen5 and Tiger Lake) can only execute one adc/adcx/adox
per cycle, and adc has a latency of 1, so breaking the dependency
chain in a beneficial way should avoid the use of adc. For our >three-summand add, it's not clear if adcx and adox can run in the
same cycle, but looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one iteration would be:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
And the version with the two additional adc-using iterations would be
(with an additional correction):
mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
- anton
Anton, I like what you and Michael have done, but I'm still not sure >everything is OK:
In your code, I only see two input arrays [rsi] and [r8], instead of
three? (Including [r9])
It would also be possible to use SETC to save the intermediate carries...
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
John Levine <johnl@taugh.com> writes:
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Yes. But if the argument had any merit that 512B is a good page size >>because it avoids having to transfer 8, 16, or 32 sectors at a time,
it would still have merit, because the interface still shows 512B
sectors.
I think we're agreeing that even in the early 1980s a 512 byte page was
too small. They certainly couldn't have made it any smaller, but they
should have made it larger.
S/370 was a decade before that and its pages were 2K or 4M. The KI-10,
the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
S/370 was a decade before that and its pages were 2K or 4K. The KI-10,...
the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
Note that 360 has optional page protection used only for access
control. In 370 era they had legacy of 2k or 4k pages, and
AFAICS IBM was mainly aiming at bigger machines, so they
were not so worried about fragmentation.
PDP-11 experience possibly contributed to using smaller pages for VAX.
Microprocessors were designed with different constraints, which
led to bigger pages. But VAX apparently could afford resonably
large TLB and due VMS structure gain was bigger than for other
OS-es.
And little correction: VAX architecture handbook is dated 1977,
so actually decision about page size had to be made at least
in 1977 and possibly earlier.
antispam@fricas.org (Waldek Hebisch) writes:
The basic question is if VAX could afford the pipeline.
VAX 11/780 only performed instruction fetching concurrently with the
rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
NVAX applied more pipelining, but CPI remained high.
VUPs MHz CPI Machine
1 5 10 11/780
4 12.5 6.25 8600
6 22.2 7.4 8700
35 90.9 5.1 NVAX+
SPEC92 MHz VAX CPI Machine
1/1 5 10/10 VAX 11/780
133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)
VUPs and SPEC numbers from
<https://pghardy.net/paul/programs/vms_cpus.html>.
The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
is probably somewhat off (due to the anecdotal base being off), but if
you relate them to each other, the offness cancels itself out.
Note that the NVAX+ was made in the same process as the 21064, the
21064 has about the clock rate, and has 4-6 times the performance,
resulting not just in a lower native CPI, but also in a lower "VAX
CPI" (the CPI a VAX would have needed to achieve the same performance
at this clock rate).
On Tue, 19 Aug 2025 05:47:01 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
are certainly capable of more than 1 adcx|adox per cycle.
Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:
Platform RC GM SK Z3
add3_my_adx_u17 244.5 471.1 482.4 407.0
Considering that there are 2166 adcx/adox/adc instructions, we have
following number of adcx/adox/adc instructions per clock:
Platform RC GM SK Z3
1.67 1.10 1.05 1.44
For Gracemont and Skylake there exists a possibility of small
measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
but more likely also 2.
It looks to me that the bottlenecks on both RC and Z3 are either rename
phase or more likely L1$ access. It seems that while Golden/Raptore Cove
can occasionally issue 3 load + 2 stores per clock, it can not sustain
more than 3 load-or-store accesses per clock
Code:
.file "add3_my_adx_u17.s"
.text
.p2align 4
.globl add3
.def add3; .scl 2; .type 32; .endef
.seh_proc add3
add3:
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
# %rcx - dst
# %rdx - a
# %r8 - b
# %r9 - c
sub %rdx, %rcx
mov %rcx, %r10 # r10 = dst - a
sub %rdx, %r8 # r8 = b - a
sub %rdx, %r9 # r9 = c - c
mov %rdx, %r11 # r11 - a
mov $60, %edx
xor %ecx, %ecx
.p2align 4
.loop:
xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
mov (%r11), %rsi
adcx (%r11,%r8), %rsi
adox (%r11,%r9), %rsi
mov 8(%r11), %rax
adcx 8(%r11,%r8), %rax
adox 8(%r11,%r9), %rax
mov %rax, 8(%r10,%r11)
Very impressive Michael!
I particularly like how you are interleaving ADOX and ADCX to gain
two carry bits without having to save them off to an additional
register.
Terje
Overall, I think that time spent by Intel engineers on invention of ADX
could have been spent much better.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
EricP <ThatWouldBeTelling@thevillage.com> writes:the same problem.
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
While HW walkers are serial for translating one VA,It's always a one-way street (towards accessed and towards modified,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
never the other direction), so it's not clear to me why one would want
atomicity there.
To avoid race conditions with software clearing those bits, presumably.
ARM64 originally didn't support hardware updates in V8.0, they were independent hardware features added to V8.1.
EricP <ThatWouldBeTelling@thevillage.com> writes:
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.
Another advantage is the flexibility: you can implement any
translation scheme you want: hierarchical page tables, inverted page
tables, search trees, .... However, given that hierarchical page
tables have won, this is no longer an advantage anyone cares for.
The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.
On an OoO engine, I don't see that. The table walker software is
called in its special context and the instructions in the table walker
are then run through the front end and the OoO engine. Another table
walk could be started at any time (even when the first table walk has
not yet finished feeding its instructions to the front end), and once
inside the OoO engine, the execution is OoO and concurrent anyway. It
would be useful to avoid two searches for the same page at the same
time, but hardware walkers have the same problem.
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.
Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.
Which poses the question: is it cheaper to implement n table walkers,
or to add some resources and mechanism that allows doing SW table
walks until the OoO engine runs out of resources, and a recovery
mechanism in that case.
I see other performance and conceptual disadvantages for the envisioned
SW walkers, however:
1) The SW walker is inserted at the front end and there may be many
ready instructions ahead of it before the instructions of the SW
walker get their turn. By contrast, a hardware walker sits in the
load/store unit and can do its own loads and stores with priority over
the program-level loads and stores. However, it's not clear that
giving priority to table walking is really a performance advantage.
2) Some decisions will have to be implemented as branches, resulting
in branch misses, which cost time and lead to all kinds of complexity
if you want to avoid resetting the whole pipeline (which is the normal reaction to a branch misprediction).
3) The reorder buffer processes instructions in architectural order.
If the table walker's instructions get their sequence numbers from
where they are inserted into the instruction stream, they will not
retire until after the memory access that waits for the table walker
is retired. Deadlock!
It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
it's probably easier to stay with hardware walkers.
- anton
On 8/17/2025 12:35 PM, EricP wrote:
The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.
Looking at the instruction set usage of VAX in
Measurement and Analysis of Instruction Use in VAX 780, 1982
https://dl.acm.org/doi/pdf/10.1145/1067649.801709
we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.
But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.
When code/density is the goal, a 16/32 RISC can do well.
Can note:
Maximizing code density often prefers fewer registers;
For 16-bit instructions, 8 or 16 registers is good;
8 is rather limiting;
32 registers uses too many bits.
Can note ISAs with 16 bit encodings:
PDP-11: 8 registers
M68K : 2x 8 (A and D)
MSP430: 16
Thumb : 8|16
RV-C : 8|32
SuperH: 16
XG1 : 16|32 (Mostly 16)
In my recent fiddling for trying to design a pair encoding for XG3, can
note the top-used instructions are mostly, it seems (non Ld/St):
ADD Rs, 0, Rd //MOV Rs, Rd
ADD X0, Imm, Rd //MOV Imm, Rd
ADDW Rs, 0, Rd //EXTS.L Rs, Rd
ADDW Rd, Imm, Rd //ADDW Imm, Rd
ADD Rd, Imm, Rd //ADD Imm, Rd
Followed by:
ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
ADDW Rd, Rs, Rd //ADDW Rs, Rd
ADD Rd, Rs, Rd //ADD Rs, Rd
ADDWU Rd, Rs, Rd //ADDWU Rs, Rd
Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.
For Load/Store:
SD Rn, Disp(SP)
LD Rn, Disp(SP)
LW Rn, Disp(SP)
SW Rn, Disp(SP)
LD Rn, Disp(Rm)
LW Rn, Disp(Rm)
SD Rn, Disp(Rm)
SW Rn, Disp(Rm)
For registers, there is a split:
Leaf functions:
R10..R17, R28..R31 dominate.
Non-Leaf functions:
R10, R18..R27, R8/R9
For 3-bit configurations:
R8..R15 Reg3A
R18/R19, R20/R21, R26/R27, R10/R11 Reg3B
Reg3B was a bit hacky, but had similar hit rates but uses less encoding space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).
BGB wrote:
On 8/17/2025 12:35 PM, EricP wrote:
The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.
Looking at the instruction set usage of VAX in
Measurement and Analysis of Instruction Use in VAX 780, 1982
https://dl.acm.org/doi/pdf/10.1145/1067649.801709
we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most
instructions.
But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.
When code/density is the goal, a 16/32 RISC can do well.
Can note:
Maximizing code density often prefers fewer registers;
For 16-bit instructions, 8 or 16 registers is good;
8 is rather limiting;
32 registers uses too many bits.
I'm assuming 16 32-bit registers, plus a separate RIP.
The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
With just 16 registers there would be no zero register.
The 4-bit register allows many 2-byte accumulate style instructions
(where a register is both source and dest)
8-bit opcode plus two 4-bit registers,
or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.
A flags register allows 2-byte short conditional branch instructions,
8-bit opcode and 8-bit offset. With no flags register the shortest conditional branch would be 3 bytes as it needs a register specifier.
If one is doing variable byte length instructions then
it allows the highest usage frequency to be most compact possible.
Eg. an ADD with 32-bit immediate in 6 bytes.
Can note ISAs with 16 bit encodings:
PDP-11: 8 registers
M68K : 2x 8 (A and D)
MSP430: 16
Thumb : 8|16
RV-C : 8|32
SuperH: 16
XG1 : 16|32 (Mostly 16)
The saving for fixed 32-bit instructions is that it only needs to
prefetch aligned 4 bytes ahead of the current instruction to maintain
1 decode per clock.
With variable length instructions from 1 to 12 bytes it could need
a 16 byte fetch buffer to maintain that decode rate.
And a 16 byte variable shifter (collapsing buffer) is much more logic.
I was thinking the variable instruction buffer shifter could be built
from tri-state buffers in a cross-bar rather than muxes.
The difference for supporting variable aligned 16-bit instructions and
byte aligned is that bytes doubles the number of tri-state buffers.
In my recent fiddling for trying to design a pair encoding for XG3,
can note the top-used instructions are mostly, it seems (non Ld/St):
ADD Rs, 0, Rd //MOV Rs, Rd
ADD X0, Imm, Rd //MOV Imm, Rd
ADDW Rs, 0, Rd //EXTS.L Rs, Rd
ADDW Rd, Imm, Rd //ADDW Imm, Rd
ADD Rd, Imm, Rd //ADD Imm, Rd
Followed by:
ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
ADDW Rd, Rs, Rd //ADDW Rs, Rd
ADD Rd, Rs, Rd //ADD Rs, Rd
ADDWU Rd, Rs, Rd //ADDWU Rs, Rd
Most every other ALU instruction and usage pattern either follows a
bit further behind or could not be expressed in a 16-bit op.
For Load/Store:
SD Rn, Disp(SP)
LD Rn, Disp(SP)
LW Rn, Disp(SP)
SW Rn, Disp(SP)
LD Rn, Disp(Rm)
LW Rn, Disp(Rm)
SD Rn, Disp(Rm)
SW Rn, Disp(Rm)
For registers, there is a split:
Leaf functions:
R10..R17, R28..R31 dominate.
Non-Leaf functions:
R10, R18..R27, R8/R9
For 3-bit configurations:
R8..R15 Reg3A
R18/R19, R20/R21, R26/R27, R10/R11 Reg3B
Reg3B was a bit hacky, but had similar hit rates but uses less
encoding space than using a 4-bit R8..R23 (saving 1 bit on the
relevant scenarios).
EricP <ThatWouldBeTelling@thevillage.com> writes:
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:
I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
My 66000 calls this mode of operation "safe stack".
This sounds like an idea worth stealing, although no doubt the way I
would attempt to copy it would be a failure which removed all the
usefulness of it.
For one thing, I don't have a stack for calling subroutines, or any other purpose.
But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.
The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
of anywhere else to save them. (There might be multiple sets of registers, for things like SMT, but *not* for user vs supervisor or anything like
that.)
So I've probably completely misunderstood you here.
John Savard
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).
An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
Thomas Koenig <tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).
An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.
1 mln transistors is an upper estimate. But low numbers given
for early RISC chips are IMO misleading: RISC become comercialy
viable for high-end machines only in later generations, when
designers added a few "expensive" instructions.
Also, to fit
design into a single chip designers moved some functionality
like bus interface to support chips. RISC processor with
mixed 16-32 bit instructions (needed to get resonable code
density), hardware multiply and FPU, including cache controller,
paging hardware and memory controller is much more than
100 thousend transitors cited for early workstation chips.
I have harped on you for a while to start development of your compiler.
One of the first things a compiler needs to do is to develop its means
to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.
According to Thomas Koenig <tkoenig@netcologne.de>:
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors mounted a few to a package. The /91 was big but it wasn't *that* big.
John Levine <johnl@taugh.com> wrote:
It's also seems rather high for the /91. I can't find any authoritative
numbers but 100K seems more likely. It was SLT, individual transistors
mounted a few to a package. The /91 was big but it wasn't *that* big.
I remember this number, but do not remember where I found it. So
it may be wrong.
However, one can estimate possible density in a different way: package >probably of similar dimensions as VAX package can hold about 100 TTL
chips. I do not have detailed data about chip usage and transistor
couns for each chip. Simple NAND gate is 4 transitors, but input
transitor has two emiters and really works like two transistors
so it is probably better to count it as 2 transitors, and conseqently >consisder 2 input NAND gate as having 5 transitors. So 74S00 gives
20 transistors. D-flop probably is about 20-30 transitors, so
74S74 is probably around 40-60. Quad D-flop bring us close to 100.
I suspect that in VAX time octal D-flops were available. There
were 4 bit ALU slices. Also multiplexers need nontrivial number
of transistors. So I think that 50 transistors is reasonable (maybe
low) estimate of average density. Assuming 50 transitors per chip
that would be 5000 transistors per package. Packages were rather
flat, so when mounted vertically one probably could allocate 1 cm
of horizotal space for each. That would allow 30 packages at
single level. With 7 levels we get 210 packages, enough for
1 mln transistors.
BGB <cr88192@gmail.com> writes:arrays:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:--- Synchronet 3.21a-Linux NewsLink 1.2
0000000000010434 <arrays>:
10434: cd81 beqz a1,1044c <arrays+0x18>
10436: 058e slli a1,a1,0x3
10438: 87aa mv a5,a0
1043a: 00b506b3 add a3,a0,a1
1043e: 4501 li a0,0
10440: 6398 ld a4,0(a5)
10442: 07a1 addi a5,a5,8
10444: 953a add a0,a0,a4
10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
1044a: 8082 ret
1044c: 4501 li a0,0
1044e: 8082 ret
0000000000010450 <globals>:
10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
10470: 8082 ret
When using -Os, arrays becomes 2 bytes shorter, but the inner loop
becomes longer.
gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following AMD64 code:
000000001139 <arrays>:
1139: 48 85 f6 test %rsi,%rsi
113c: 74 13 je 1151 <arrays+0x18>
113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
1142: 31 c0 xor %eax,%eax
1144: 48 03 07 add (%rdi),%rax
1147: 48 83 c7 08 add $0x8,%rdi
114b: 48 39 d7 cmp %rdx,%rdi
114e: 75 f4 jne 1144 <arrays+0xb>
1150: c3 ret
1151: 31 c0 xor %eax,%eax
1153: c3 ret
000000001154 <globals>:
1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
115b: 56 34 12
115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a>
1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
116c: 12 ef cd
116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b>
1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
117d: 90 78 56
1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c>
1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
118e: 90 78 56
1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d>
1198: c3 ret
gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following ARM A64 code:
0000000000000734 <arrays>:
734: b4000121 cbz x1, 758 <arrays+0x24>
738: aa0003e2 mov x2, x0
73c: d2800000 mov x0, #0x0 // #0
740: 8b010c43 add x3, x2, x1, lsl #3
744: f8408441 ldr x1, [x2], #8
748: 8b010000 add x0, x0, x1
74c: eb03005f cmp x2, x3
750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
754: d65f03c0 ret
758: d2800000 mov x0, #0x0 // #0
75c: d65f03c0 ret
0000000000000760 <globals>:
760: d299bde2 mov x2, #0xcdef // #52719
764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
768: f2b21562 movk x2, #0x90ab, lsl #16
76c: 9100e020 add x0, x1, #0x38
770: f2cacf02 movk x2, #0x5678, lsl #32
774: d2921563 mov x3, #0x90ab // #37035
778: f2e24682 movk x2, #0x1234, lsl #48
77c: f9001c22 str x2, [x1, #56]
780: d2824682 mov x2, #0x1234 // #4660
784: d299bde1 mov x1, #0xcdef // #52719
788: f2aacf03 movk x3, #0x5678, lsl #16
78c: f2b9bde2 movk x2, #0xcdef, lsl #16
790: f2a69561 movk x1, #0x34ab, lsl #16
794: f2c24683 movk x3, #0x1234, lsl #32
798: f2d21562 movk x2, #0x90ab, lsl #32
79c: f2d20241 movk x1, #0x9012, lsl #32
7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
7a4: f2eacf02 movk x2, #0x5678, lsl #48
7a8: f2eacf01 movk x1, #0x5678, lsl #48
7ac: a9008803 stp x3, x2, [x0, #8]
7b0: f9000c01 str x1, [x0, #24]
7b4: d65f03c0 ret
So, the overall sizes (including data size for globals() on RV64GC) are:
arrays globals Architecture
28 66 (34+32) RV64GC
27 69 AMD64
44 84 ARM A64
So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test. Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions, so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:
* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.
NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
advantage of RV32GC over RV64GC there:
NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:
libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
I guess it can be noted, is the overhead of any ELF metadata being >excluded?...
These are sizes of the .text section extracted with objdump -h. So
no, these numbers do not include ELF metadata, nor the sizes of other sections. The latter may be relevant, because RV64GC has "immediates"
in .sdata that other architectures have in .text; however, .sdata can
contain other things than just "immediates", so one cannot just add the .sdata size to the .text size.
Granted, newer compilers do support newer versions of the C standard,
and also typically get better performance.
The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
from auto-vectorization).
There is one other improvement: gcc register allocation has improved
in recent years to a point where we 1) no longer need explicit
register allocation for Gforth on AMD64, and 2) with a lot of manual
help, we could increase the number of stack cache registers from 1 to
3 on AMD64, which gives some speedups typically in the 0%-20% range in Gforth.
But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
which is vectorizable, I still have not been able to get gcc to auto-vectorize it, even with some transformations which should help.
I have not measured the scalar versions again, but given that there
were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
I doubt that I will see consistent speedups with newer gcc (or clang) versions.
- anton
On 7/28/2025 6:18 PM, John Savard wrote:
On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:
VAX tried too hard in my opinion to close the semantic gap.
Any operand could be accessed with any address mode. Now while this
makes the puny 16-register file seem larger,
what VAX designers forgot, is that each address mode was an instruction
in its own right.
So, VAX shot at minimum instruction count, and purposely miscounted
address modes not equal to %k as free.
Fancy addressing modes certainly aren't _free_. However, they are,
in my opinion, often cheaper than achieving the same thing with an
extra instruction.
So it makes sense to add an addressing mode _if_ what that addressing
mode does is pretty common.
The use of addressing modes drops off pretty sharply though.
Like, if one could stat it out, one might see a static-use pattern
something like:
80%: [Rb+disp]
15%: [Rb+Ri*Sc]
3%: (Rb)+ / -(Rb)
1%: [Rb+Ri*Sc+Disp]
<1%: Everything else
Though, I am counting [PC+Disp] and [GP+Disp] as part of [Rb+Disp] here.
Granted, the dominance of [Rb+Disp] does drop off slightly when
considering dynamic instruction use. Part of it is due to the
prolog/epilog sequences.
If one had instead used (SP)+ and -(SP) addressing for prologs and
epilogs, then one might see around 20% or so going to these instead.
Or, if one had PUSH/POP, to PUSH/POP.
The discrepancy then between static and dynamic instruction counts them being mostly due to things like loops and similar.
Estimating the effect of loops in a compiler is hard, but had noted that assuming a scale factor of around 1.5^D for loop nesting levels (D)
seemed to be in the area. Many loops end up unreached in many
iterations, or only running a few times, so possibly counter-intuitively
it is often faster to assume that a loop body will likely only cycle 2
or 3 times rather than 100s or 1000s, and trying to aggressively
optimize loops by assuming large N tends to be detrimental to performance.
Well, and at least thus far, profiler-driven optimization isn't really a thing in my case.
One could maybe argue for some LoadOp instructions, but even this is debatable. If the compiler is designed mostly for Load/Store, and the
ISA has a lot of registers, the relative benefit of LoadOp is reduced.
LoadOp being mostly a benefit if the value is loaded exactly once, and
there is some other ALU operation or similar that can be fused with it.
Practically, it limits the usefulness of LoadOp mostly to saving an instruction for things like:
z=arr[i]+x;
But, the relative incidence of things like this is low enough as to not
save that much.
The other thing is that one has to implement it in a way that does not increase pipeline length,
since if one makes the pipeline linger for
sake of LoadOp or OpStore, then this is likely to be a net negative for performance vs prioritizing Load/Store, unless the pipeline had already needed to be lengthened for other reasons.
One can be like, "But what if the local variables are not in registers?"
but on a machine with 32 or 64 registers, most likely your local
variable is already going to be in a register.
So, the main potential merit of LoadOp being "doesn't hurt as bad on a register-starved machine".
That being said, though, designing a new machine today like the VAX
would be a huge mistake.
But the VAX, in its day, was very successful. And I don't think that
this was just a result of riding on the coattails of the huge popularity
of the PDP-11. It was a good match to the technology *of its time*,
that being machines that were implemented using microcode.
Yeah.
There are some living descendants of that family, but pretty much
everything now is Reg/Mem or Load/Store with a greatly reduced set of addressing modes.
John Savard
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.
Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.
Another aspect from those measurements is that the 68k instruction set
(with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.
Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and
PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.
DEC probably was aware from the work of William Wulf and his students
what optimizing compilers can do and how to write them. After all,
they used his language BLISS and its compiler themselves.
POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting
one address modify another in the same instruction, would have made it a lot >easier to pipeline.
My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
given us.
Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.
Also, PDP-11 compatibility depended on microcode.
Without microcode engine one would need parallel set
of hardware instruction decoders, which could add
more complexity than was saved by removing microcode
engine.
To summarize, it is not clear to me if RISC in VAX technology
could be significantly faster than VAX especially given constraint
of PDP-11 compatibility.
OTOH VAX designers probably felt
that CISC nature added significant value: they understood
that cost of programming was significant and believed that
ortogonal instruction set, in particular allowing complex
addresing on all operands made programming simpler.
They
probably thought that providing resonably common procedures
as microcoded instructions made work of programmers simpler
even if routines were only marginally faster than ordinary
code.
Part of this thinking was probably like "future
system" motivation at IBM: Digital did not want to produce
"commodity" systems, they wanted something with unique
features that custemer will want to use.
Without
isight into future it is hard to say that they were
wrong.
On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:
If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.
That's certainly a way to do it. But then you either need to dedicate--- Synchronet 3.21a-Linux NewsLink 1.2
one base register to each array - perhaps easier if there's opcode
space to use all 32 registers as base registers, which this would allow -
or you would have to load the base register with the address of the
array.
John Savard
One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)
10.5 on a characteristic mix, actually.
See "A Characterization of Processor Performance in the VAX-11/780"
by Emer and Clark, their Table 8.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:...
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
[...] POLY as an
instruction is bad.
One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)
Pipeline work over 1983-to-current has shown that LD and OPs perform
just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
were LD and OP, and there are way to perform LD and OP as if it were
LD+OP.
Condition codes get hard when DECODE width grows greater than 3.
antispam@fricas.org (Waldek Hebisch) posted:
-----------snip--------------
If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.
Witness Mc 68000, Mc 68010, and Mc 68020. In all these
designs, the microcode and its surrounding engine took
1/2 of the die-area insides the pins.
In 1980 it was possible to put the data path of a 32-bit
ISA on one die and pipeline it, but runs out of area when
you put microcode on the same die (area). Thus, RISC was
born. Mc88100 had a decoder and sequencer that was 1/8
of the interior area of the chip and had 4 FUs {Int,
Mem, MUL, and FADD} all pipelined.
It appears that Waldek Hebisch <antispam@fricas.org> said:
My idea was that instruction decoder could essentially translate
ADDL (R2)+, R2, R3
into
MOV (R2)+, TMP
ADDL TMP, R2, R3
But how about this?
ADDL3 (R2)+,(R2)+,(R2)+
Now you need at least two temps, the second of which depends on the
first, and there are instructions with six operands. Or how about
this:
ADDL3 (R2)+,#1234,(R2)+
This is encoded as
OPCODE (R2)+ (PC)+ <1234> (R2)+
The immediate word is in the middle of the instruction. You have to decode the operands one at a time so you can recognize immediates and skip over them.
It must have seemed clever at the time, but ugh.
Thomas Koenig wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)
10.5 on a characteristic mix, actually.
See "A Characterization of Processor Performance in the VAX-11/780"
by Emer and Clark, their Table 8.
Going through the VAX 780 hardware schematics and various performance
papers, near as I can tell it took *at least* 1 clock per instruction byte for decode, plus any I&D cache miss and execute time, as it appears to
use microcode to pull bytes from the 8-byte instruction buffer (IB)
*one at a time*.
So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.
And I say "at least" 1 C/IB as I am not including any micro-pipeline
stalls.
The microsequencer has some pipelining, overlap read of the next uWord
with execute of current, which would introduce a branch delay slot into
the microcode. As it uses the opcode and operand bytes to do N-way
jump/call
to uSubroutines, each of those dispatches might have a branch delay slot too.
(Similar issues appear in the MV-8000 uSequencer except it appears to
have 2 or maybe 3 microcode branch delay slots).
John Levine <johnl@taugh.com> posted:
It appears that Waldek Hebisch <antispam@fricas.org> said:
My idea was that instruction decoder could essentially translateBut how about this?
ADDL (R2)+, R2, R3
into
MOV (R2)+, TMP
ADDL TMP, R2, R3
ADDL3 (R2)+,(R2)+,(R2)+
Now you need at least two temps, the second of which depends on the
first, and there are instructions with six operands. Or how about
this:
ADDL3 (R2)+,#1234,(R2)+
This is encoded as
OPCODE (R2)+ (PC)+ <1234> (R2)+
The immediate word is in the middle of the instruction. You have to decode >> the operands one at a time so you can recognize immediates and skip over them.
It must have seemed clever at the time, but ugh.
What we must all realize is that each address mode in VAX was a microinstruction all unto itself.
And that is why it was not pipelineable in any real sense.
There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .
There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .
AFAIK this is a problem only in those rare languages where a function
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.
In most cases you have either one of the other but not both. E.g. in
C we don't have nested functions, and in Javascript functions are heap-allocated objects.
Other than GNU C (with its support for nested functions), which other language has this weird combination of features?
There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .
AFAIK this is a problem only in those rare languages where a function
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.
In most cases you have either one of the other but not both. E.g. in
C we don't have nested functions, and in Javascript functions are heap-allocated objects.
Other than GNU C (with its support for nested functions), which other language has this weird combination of features?
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .
AFAIK this is a problem only in those rare languages where a function
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.
In most cases you have either one of the other but not both. E.g. in
C we don't have nested functions, and in Javascript functions are
heap-allocated objects.
Other than GNU C (with its support for nested functions), which other
language has this weird combination of features?
Well, more precisely:
- function pointer is supposed to take the same space as a single
machine address
- function pointer is supposed to be directly invokable, that is
point to machine code of the function
- one wants to support nested functions
- there is no garbage collector, one does not want to introduce extra
stack and one does not want to leak storage allocated to nested
functions.
To explain more:
- arguably in "safe" C data pointers should consist
of 3 machine words, such pointer have place for extra data needed
for nested functions.
- some calling conventions introduce extra indirection, that is
function pointer point to a data structure containing address
of machine code and extra data needed by nested functions.
Function call puts extra data in dedicated machine register and
then transfers control via address contained in function data
structure. IIUC IBM AIX uses such approach.
- one could create trampolines in a separate area of memory. In
such case there is trouble with dealocating no longer needed
trampolines. This trouble can be resolved by using GC. Or
by using a parallel stack dedicated to trampolines.
Concerning languages, any language which has nested functions and
wants seamless cooperation with C needs to resolve the problem.
That affects Pascal, Ada, PL/I. That is basicaly most classic
non-C languages. IIUC several "higher level" languages resolve
the trouble by combination of parallel stack and/or GC. But
when language want to compete with efficiency of C and does not
want GC, then trampolines allocated on machine stack may be the
only choice (on register starved machine parallel stack may be
too expensive). AFAIK GNU Ada uses (or used) trampolines
allocated on machine stack.
Function pointer consists of a pointer to a blob of memory holding
a code pointer and typically the callee's GOT pointer.
Function pointer consists of a pointer to a blob of memory holding
a code pointer and typically the callee's GOT pointer.
Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so
there's no dynamic allocation involved.
Stefan
AFAIK this is a problem only in those rare languages where a function...
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.
Other than GNU C (with its support for nested functions), which other >language has this weird combination of features?
On 8/30/2025 1:22 PM, Stefan Monnier wrote:
Function pointer consists of a pointer to a blob of memory holding
a code pointer and typically the callee's GOT pointer.
Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so there's no dynamic allocation involved.
FDPIC typically always uses the normal pointer width, just with more indirection:
Load target function pointer from GOT;
Save off current GOT pointer to stack;
Load code pointer from function pointer;
Load GOT pointer from function pointer;
Call function;
Reload previous GOT pointer.
It, errm, kinda sucks...
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
So, the overall sizes (including data size for globals() on RV64GC) are:32 68 My 66000 8 5
Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test.
Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions,
so certainly in this case RV64GC is not* My 66000 uses ST immediate for globals
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:
* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
Apr 2003: Opteron launch
Sep 2003: Athlon 64 launch
Oct 2003 (IIRC): I buy an Athlon 64
Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC
I installed Fedora Core 1 on my Athlon64 box in early 2004.
Why wait for MS?
... I don't think GNU/Linux enthusiasts were the main buyers of
those Opteron and Athlon64 machines.
Apr 2003: Opteron launch
Sep 2003: Athlon 64 launch
Oct 2003 (IIRC): I buy an Athlon 64
Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC
I installed Fedora Core 1 on my Athlon64 box in early 2004.
Why wait for MS?
Same here (tho I was on team Debian)
I would have liked to install 64-bit Debian (IIRC I initially ran
32-bit Debian on the Athlon 64), but they were not ready at the time
... so eventually I decided to go with Fedora Core 1, which just
implemented /lib and /lib64 and was there first.
For some reason I switched to Gentoo relatively soon after ...
before finally settling in Debian-land several years later.
I would have liked to install 64-bit Debian (IIRC I initially ran
32-bit Debian on the Athlon 64), but they were not ready at the time,
and still busily working on their multi-arch (IIRC) plans, so
eventually I decided to go with Fedora Core 1, which just implemented
/lib and /lib64 and was there first.
For some reason I switched to Gentoo relatively soon after
(/etc/hostname from 2005-02-20, and IIRC Debian still had not finished >hammering out multi-arch at that time), before finally settling in >Debian-land several years later.
Reading some more, Debian 4.0 (Etch), released 8 April 2007, was the
first Debian with official AMD64 support.
Reading some more, Debian 4.0 (Etch), released 8 April 2007, was the
first Debian with official AMD64 support.
Indeed, I misremembered: I used Debian's i386 port on my 2003 AMD64
machine.
It didn't have enough RAM to justify the bother of distro hopping. 🙂
Stefan
It didn't have enough RAM to justify
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET
-----------------
So, the overall sizes (including data size for globals() on RV64GC) are:32 68 My 66000 8 5
Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
--- Synchronet 3.21a-Linux NewsLink 1.2So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test.
Size is one thing, sooner or later one has to execute the instructions,
and here My 66000needs to execute fewer, while being within spitting
distance of code size.
Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions,
3 for My 66000
so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:
* RV64GC uses a compare-and-branch instruction.* My 66000 uses ST immediate for globals
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.
- anton
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET
-----------------
32 68 My 66000 8 5
So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller AND so long as the ISA does not resort to sequential decode (i.e., VAX).
What say ye !
On 9/4/2025 8:23 AM, MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET
-----------------
32 68 My 66000 8 5
So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
In general yes, but as you pointed out in another post, if you are
talking about a GBOoO machine, it isn't the absolute number of
instructions (because of parallel execution), but the number of cycles
to execute a particular routine. Of course, this is harder to tell at a glance from a code listing.
At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller
AND so long as the ISA does not resort to sequential decode (i.e., VAX).
What say ye !
And, of course your "150%" is arbitrary,
but I agree that small
differences in code size are not important, except in some small
embedded applications.
And I guess I would add, as a third, much lower priority, power usage.
On 9/4/2025 8:23 AM, MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables; >>>>> Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET
-----------------
32 68 My 66000 8 5
So, the overall sizes (including data size for globals() on RV64GC)
are:
Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9 >>>> 27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
At some scale, smaller code size is beneficial, but once the
implementation
has a GBOoO µarchitecture, I would think that fewer instructions is
better
than smaller code--so long as the code size is less than 150% of the
smaller
AND so long as the ISA does not resort to sequential decode (i.e., VAX).
What say ye !
In general yes, but as you pointed out in another post, if you are
talking about a GBOoO machine, it isn't the absolute number of
instructions (because of parallel execution), but the number of cycles
to execute a particular routine. Of course, this is harder to tell at a glance from a code listing.
And, of course your "150%" is arbitrary, but I agree that small
differences in code size are not important, except in some small
embedded applications.
And I guess I would add, as a third, much lower priority, power usage.
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
#include <stddef.h>32 68 My 66000 8 5
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
So, the overall sizes (including data size for globals() on RV64GC) are: >> > Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
At some scale, smaller code size is beneficial, but once the implementation >has a GBOoO µarchitecture, I would think that fewer instructions is better >than smaller code--so long as the code size is less than 150% of the smaller >AND so long as the ISA does not resort to sequential decode (i.e., VAX).
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
#include <stddef.h>32 68 My 66000 8 5
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
So, the overall sizes (including data size for globals() on RV64GC) are: >>>> Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
Performance from a given chip area.
The RISC-V people argue that they can combine instructions with a few transistors. But, OTOH, they have 16-bit and 32-bit wide
instructions, which means that a part of the decoder results will be
thrown away, increasing the decode cost for a given number of average
decoded instructions per cycle. Plus, they need more decoded
instructions per cycle for a given amount of performance.
Intel and AMD demonstrate that you can get high performance even with
an instruction set that is even worse for decoding, but that's not cheap.
ARM A64 goes the other way: Fixed-width instructions ensure that all
decoding on correctly predicted paths is actually useful.
However, it pays for this in other ways: Instructions like load pair
with auto-increment need to write 3 registers, and the write port
arbitration certainly has a hardware cost. However, such an
instruction would need two loads and an add if expressed in RISC-V; if
RISC-V combines these instructions, it has the same write-port
arbitration problem. If it does not combine at least the loads, it
will tend to perform worse with the same number of load/store units.
So it's a balancing game: If you lose some weight here, do you need to
add the same, more, or less weight elsewhere to compensate for the
effects elsewhere?
At some scale, smaller code size is beneficial, but once the implementation >> has a GBOoO µarchitecture, I would think that fewer instructions is better >> than smaller code--so long as the code size is less than 150% of the smaller >> AND so long as the ISA does not resort to sequential decode (i.e., VAX).
I don't think that even VAX encoding would be the major problem of the
VAX these days. There are microop caches and speculative decoders for
that (although, as EricP points out, the VAX is an especially
expensive nut to crack for a speculative decoder).
In any case, if smaller code size was it, RV64GC would win according
to my results. However, compilers often generate code that has a
bigger code size rather than a smaller one (loop unrolling, inlining),
so code size is not that important in the eyes of the maintainers of
these compilers.
I also often see code produced with more (dynamic) instructions than necessary. So the number of instructions is apparently not that
important, either.
- anton
On 9/5/2025 10:03 AM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
For example:
* 00in-nnnn-iiii-0000 ADD Imm5s, Rn5 //"ADD 0, R0" = TRAP
* 01in-nnnn-iiii-0000 LI Imm5s, Rn5
* 10mn-nnnn-mmmm-0000 ADD Rm5, Rn5
* 11mn-nnnn-mmmm-0000 MV Rm5, Rn5
* 0000-nnnn-iiii-0100 ADDW Imm4u, Rn4
* 0001-nnnn-mmmm-0100 SUB Rm4, Rn4
* 0010-nnnn-mmmm-0100 ADDW Imm4n, Rn4
* 0011-nnnn-mmmm-0100 MVW Rm4, Rn4 //ADDW Rm, 0, Rn
* 0100-nnnn-mmmm-0100 ADDW Rm4, Rn4
* 0101-nnnn-mmmm-0100 AND Rm4, Rn4
* 0110-nnnn-mmmm-0100 OR Rm4, Rn4
* 0111-nnnn-mmmm-0100 XOR Rm4, Rn4
* 0iii-0nnn-0mmm-1001 ? SLL Rm3, Imm3u, Rn3
* 0iii-0nnn-1mmm-1001 ? SRL Rm3, Imm3u, Rn3
* 0iii-1nnn-0mmm-1001 ? ADD Rm3, Imm3u, Rn3
* 0iii-1nnn-1mmm-1001 ? ADDW Rm3, Imm3u, Rn3
* 1iii-0nnn-0mmm-1001 ? AND Rm3, Imm3u, Rn3
* 1iii-0nnn-1mmm-1001 ? SRA Rm3, Imm3u, Rn3
* 1iii-1nnn-0mmm-1001 ? ADD Rm3, Imm3n, Rn3
* 1iii-1nnn-1mmm-1001 ? ADDW Rm3, Imm3n, Rn3
* 0ooo-0nnn-0mmm-1101 ? SLL Rm3, Ro3, Rn3
* 0ooo-0nnn-1mmm-1101 ? SRL Rm3, Ro3, Rn3
* 0ooo-1nnn-0mmm-1101 ? AND Rm3, Ro3, Rn3
* 0ooo-1nnn-1mmm-1101 ? SRA Rm3, Ro3, Rn3
* 1ooo-0nnn-0mmm-1101 ? ADD Rm3, Ro3, Rn3
* 1ooo-0nnn-1mmm-1101 ? SUB Rm3, Ro3, Rn3
* 1ooo-1nnn-0mmm-1101 ? ADDW Rm3, Ro3, Rn3
* 1ooo-1nnn-1mmm-1101 ? SUBW Rm3, Ro3, Rn3
* 0ddd-nnnn-mmmm-0001 LW Disp3u(Rm4), Rn4
* 1ddd-nnnn-mmmm-0001 LD Disp3u(Rm4), Rn4
* 0ddd-nnnn-mmmm-0101 SW Rn4, Disp3u(Rm4)
* 1ddd-nnnn-mmmm-0101 SD Rn4, Disp3u(Rm4)
* 00dn-nnnn-dddd-1001 LW Disp5u(SP), Rn5
* 01dn-nnnn-dddd-1001 LD Disp5u(SP), Rn5
* 10dn-nnnn-dddd-1001 SW Rn5, Disp5u(SP)
* 11dn-nnnn-dddd-1001 SD Rn5, Disp5u(SP)
* 00dd-dddd-dddd-1101 J Disp10
* 01dn-nnnn-dddd-1101 LD Disp5u(SP), FRn5
* 10in-nnnn-iiii-1101 LUI Imm5s, Rn5
* 11dn-nnnn-dddd-1101 SD FRn5, Disp5u(SP)
Could achieve a higher average hit-rate than RV-C while *also* using
less encoding space.
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET
-----------------
32 68 My 66000 8 5
So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller AND so long as the ISA does not resort to sequential decode (i.e., VAX).
What say ye !
So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test.
Size is one thing, sooner or later one has to execute the instructions,
and here My 66000needs to execute fewer, while being within spitting
distance of code size.
Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions,
3 for My 66000
so certainly in this case RV64GC is not* My 66000 uses ST immediate for globals
crappier than the others. Interestingly, the reasons for using four
instructions (rather than five) are different on these architectures:
* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.
- anton
Things could be architect-ed to allow a tradeoff between code size and number of instructions executed in the same ISA. Sometimes one may want really small code; other times performance is more important.
- one could create trampolines in a separate area of memory. In
such case there is trouble with dealocating no longer needed
trampolines. This trouble can be resolved by using GC. Or
by using a parallel stack dedicated to trampolines.
Waldek Hebisch <antispam@fricas.org> schrieb:
- one could create trampolines in a separate area of memory. In
such case there is trouble with dealocating no longer needed
trampolines. This trouble can be resolved by using GC. Or
by using a parallel stack dedicated to trampolines.
[...]
gcc has -ftrampoline-impl=[stack|heap], see https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html
Don't longjmp out of a nested function though.
Thomas Koenig <tkoenig@netcologne.de> posted:
Waldek Hebisch <antispam@fricas.org> schrieb:
- one could create trampolines in a separate area of memory. In
such case there is trouble with dealocating no longer needed
trampolines. This trouble can be resolved by using GC. Or
by using a parallel stack dedicated to trampolines.
[...]
gcc has -ftrampoline-impl=[stack|heap], see
https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html
Don't longjmp out of a nested function though.
Or longjump around subroutines using 'new'.
Or longjump out of 'signal' handlers.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Or longjump out of 'signal' handlers.
Again, one must take the appropriate care. Such as
using the correct API (e.g. POSIX siglongjmp(2)).
It is quite common to use siglongjmp to leave
a SIGINT (Control-C) handler.
scott@slp53.sl.home (Scott Lurndal) writes:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Or longjump out of 'signal' handlers.
Again, one must take the appropriate care. Such as
using the correct API (e.g. POSIX siglongjmp(2)).
The restrictions on siglongjmp() when jumping out of signal handlers
are the same as for longjmp(). See the section "Application Usage" in >https://pubs.opengroup.org/onlinepubs/9799919799/functions/longjmp.html
The only difference is that sigsetjmp() saves the signal mask and >siglongjmp() restores it.
Same here (tho I was on team Debian), but I don't think GNU/Linux
enthusiasts were the main buyers of those Opteron and
Athlon64 machines.
On 31.08.2025 16:43 Uhr Stefan Monnier wrote:
Same here (tho I was on team Debian), but I don't think GNU/Linux
enthusiasts were the main buyers of those Opteron and
Athlon64 machines.
Athlon 64 machines were mostly shipped with Windows XP 32 bit - even
when XP 64 bit existed for that architecture.
On 2025-09-21, Marco Moock wrote:
On 31.08.2025 16:43 Uhr Stefan Monnier wrote:
Same here (tho I was on team Debian), but I don't think GNU/Linux
enthusiasts were the main buyers of those Opteron and
Athlon64 machines.
Athlon 64 machines were mostly shipped with Windows XP 32 bit - even
when XP 64 bit existed for that architecture.
Which one was NT 5.2 and not 5.1? XP for IA-64 or XP for amd64?
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,071 |
Nodes: | 10 (0 / 10) |
Uptime: | 186:26:31 |
Calls: | 13,762 |
Calls today: | 1 |
Files: | 186,985 |
D/L today: |
8,390 files (2,645M bytes) |
Messages: | 2,427,100 |