• Re: VAX and other pages

    From John Levine@johnl@taugh.com to comp.arch on Fri Aug 15 20:40:44 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    ARM64 (ARMv8) architecturally supports 4k, 16k and 64k.

    S/370 had 2K or 4K pages grouped into 64K or 1M segment tables. By the time it became S/390 it was just 4K pages and 1M segment tables, in a 31 bit address spave.

    In zSeries there are multiple 2G regions consisting of 1M segments and 4K pages.
    A segment can optionally be mapped as a single unit, in effect a 1M page.

    These days it doesn't make much sense to have pages smaller than 4K since that's the block size on most disks. I can believe that with today's giant memories and bloated programs larger than 4K pages would work better.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Aug 15 21:22:53 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    These days it doesn't make much sense to have pages smaller than 4K since >that's the block size on most disks.

    Two block devices bought less than a year ago:

    Disk model: KINGSTON SEDC2000BM8960G
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes

    Disk model: WD Blue SN580 2TB
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Aug 16 01:22:57 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    These days it doesn't make much sense to have pages smaller than 4K since >>that's the block size on most disks.

    Two block devices bought less than a year ago:

    SSDs often let you do 512 byte reads and writes for backward compatibility even though the physical block size is much larger.

    Wikipedia tells us all about it:

    https://en.wikipedia.org/wiki/Advanced_Format#512_emulation_(512e)

    Disk model: KINGSTON SEDC2000BM8960G

    Says here the block size of the 480GB version is 16K, so I'd assume the 960GB is
    the same:

    https://www.techpowerup.com/ssd-specs/kingston-dc2000b-480-gb.d2166

    Disk model: WD Blue SN580 2TB

    I can't find anything on its internal structure but I see the vendor's random read/write benchmarks all use 4K blocks so that's probably the internal block size.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Aug 16 05:09:43 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size
    because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors. In 1985, 1986 and 1992 the common HDDs of the time had
    actual 512B sectors, so if that argument had any merit, the i386
    (1985), MIPS R2000 (1986), SPARC (1986), and Alpha (1992) should have
    been introduced with 512B pages, but they actually were introduced
    with 4KB (386, MIPS, SPARC) and 8KB (Alpha) pages.

    Disk model: WD Blue SN580 2TB

    I can't find anything on its internal structure but I see the vendor's random >read/write benchmarks all use 4K blocks so that's probably the internal block >size.

    https://www.techpowerup.com/ssd-specs/western-digital-sn580-2-tb.d1542

    claims

    |Page Size: 16 KB
    |Block Size: 1344 Pages

    I assume that the "Block size" means the size of an erase block.
    Where does the number 1344 come from? My guess is that it has to do
    with:

    |Type: TLC
    |Technology: 112-layer

    3*112*4=1344

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 16 03:17:49 2025
    From Newsgroup: comp.arch

    On 8/15/2025 2:03 PM, Stephen Fuld wrote:
    On 8/15/2025 11:19 AM, BGB wrote:
    On 8/15/2025 11:53 AM, John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB.  The TLB is claimed to >>>>>> have "typically 97% hit rate".  I would go for larger pages, which >>>>>> would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.

    Yeah.


    Can note in some of my own testing, I tested various page sizes, and
    seemingly found a local optimum at around 16K.

    I think that is consistent with what some others have found.  I suspect
    the average page size should grow as memory gets cheaper, which leads to more memory on average in systems.  This also leads to larger programs,
    as they can "fit" in larger memory with less paging.  And as disk
    (spinning or SSD) get faster transfer rates, the cost (in time) of
    paging a larger page goes down.  While 4K was the sweet spot some
    decades ago, I think it has increased, probably to 16K.  At some point
    in the future, it may get to 64K, but not for some years yet.


    Some of the programs I have tested don't have particularly large memory footprints by modern standards (~ 10 to 50MB).

    Excluding very small programs (where TLB miss rate becomes negligible)
    had noted that 16K appeared to be reasonably stable.


    Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
    but 16K to 32K or 64K did not see any significant reduction; but did
    see a more significant increase in memory footprint due to allocation
    overheads (where, OTOH, going from 4K to 16K pages does not see much
    increase in memory footprint).

    Patterns seemed consistent across multiple programs tested, but harder
    to say if this pattern would be universal.


    Had noted if running stats on where in the pages memory accesses land:
       4K: Pages tend to be accessed fairly evenly
      16K: Minor variation as to what parts of the page are being used.
      64K: Significant variation between parts of the page.
    Basically, tracking per-page memory accesses on a finer grain boundary
    (eg, 512 bytes).

    Interesting.


    Say, for example, at 64K one part of the page may be being accessed
    readily but another part of the page isn't really being accessed at
    all (and increasing page size only really sees benefit for TLB miss
    rate so long as the whole page is "actually being used").

    Not necessarily.  Consider the case of a 16K (or larger) page with two
    "hot spots" that are more than 4K apart.  That takes 2 TLB slots with 4K pages, but only one with larger pages.


    This is part of why 16K has an advantage.

    But, it drops off with 32K or 64K, as one may have a lot of large gaps
    of relatively little activity.

    So, rather than having a 64K page with two or more hot-spots ~ 30K apart
    or less, one may often just have a lot of pages with one hot-spot.

    Granted, my testing was far from exhaustive...


    One may think that larger page would always be better for TLB miss rate,
    but this assumes that most of the pages have most of the page being
    accessed.

    Which, as noted, is fairly true at 4/8/16K, but seemingly not as true at
    32K or 64K.

    And, for the more limited effect of the larger page size on reducing TLB
    miss rate, one does have a lot more memory being wasted by things like "mmap()" type calls.


    Say, for example, you want to allocate 93K via "mmap()":
    4K pages: 96K (waste=3K, 3%)
    8K pages: 96K
    16K pages: 96K
    32K pages: 96K
    64K pages: 128K (waste=35K, 27%)
    OK, 99K:
    4K: 100K (waste= 1K, 1%)
    8K: 104K (waste= 5K, 5%)
    16K: 112K (waste=13K, 12%)
    32K: 128K (waste=29K, 23%)
    64K: 128K
    What about 65K:
    4K: 68K (waste= 3K, 4%)
    8K: 72K (waste= 7K, 10%)
    16K: 80K (waste=15K, 19%)
    32K: 96K (waste=31K, 32%)
    64K: 128K (waste=63K, 49%)

    ...


    So, bigger pages aren't great for "mmap()" with smaller allocation sizes.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Aug 16 10:00:18 2025
    From Newsgroup: comp.arch

    On 8/15/2025 10:09 PM, Anton Ertl wrote:
    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size
    because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors.

    I don't think anyone has argued for 512B page sizes. There are two
    issues that are perhaps being conflated. One is whether it would be
    better if page sizes were increased from the current typical 4K to 16K.
    The other is about changing the size of blocks on disks (both hard disks
    and SSDs) from 512 bytes to 4K bytes.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Aug 16 17:06:42 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size
    because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors.

    I think we're agreeing that even in the early 1980s a 512 byte page was
    too small. They certainly couldn't have made it any smaller, but they
    should have made it larger.

    S/370 was a decade before that and its pages were 2K or 4M. The KI-10,
    the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
    on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 16 15:26:31 2025
    From Newsgroup: comp.arch

    On 8/7/2025 6:38 AM, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    EricP wrote:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>> were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline
    running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
    (using some in-order superscalar ideas and two reg file write ports
    to "catch up" after pipeline bubbles).

    TTL risc would also be much cheaper to design and prototype.
    VAX took hundreds of people many many years.

    The question is could one build this at a commercially competitive price?
    There is a reason people did things sequentially in microcode.
    All those control decisions that used to be stored as bits in microcode now >> become real logic gates. And in SSI TTL you don't get many to the $.
    And many of those sequential microcode states become independent concurrent >> state machines, each with its own logic sequencer.

    I am confused. You gave a possible answer in the posting you are
    replying to.

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    It is maybe pushing it a little if one wants to use an AVL-tree or
    B-Tree for virtual memory vs a page-table, but is otherwise pretty much
    OK assuming TLB miss rate isn't too unreasonable.


    For the TLB, had noticed best results with 4 or 8 way associativity:
    1-way: Doesn't work for main TLB.
    1-way works OK for an L1-TLB in a split L1/L2 TLB config.
    2-way: Barely works
    In some edge cases and configurations,
    may get stuck in a TLB miss loop.
    4-way: Works fairly well, cheaper option.
    8-way: Works better, but significantly more expensive.

    A possible intermediate option could be 6-way associativity.
    Full associativity is impractically expensive.
    Also a large set associative TLB beats a small full associative TLB.

    For a lot of the test programs I run, TLB size:
    64x: Small, fairly high TLB miss rate.
    256x: Mostly good
    512x or 1024x: Can mostly eliminate TLB misses, but debatable.

    In practice, this has mostly left 256x4 as the main configuration for
    the Main TLB. Optionally, can use a 64x1 L1 TLB (with the main TLB as an
    L2 TLB), but this is optional.


    A hardware page walker or inverted page table has been considered, but
    not crossed into use yet. If I were to add a hardware page walker, it
    would likely be semi-optional (still allowing processes to use
    unconventional memory management as needed, *).

    Supported page sizes thus far are 4K, 16K, and 64K. In test-kern, 16K
    mostly won out, using a 3-level page table and 48-bit address space,
    though technically the current page-table layout only does 47 bits.

    Idea was that the high half of the address space could use a separate
    System page table, but this isn't really used thus far.

    *: One merit of software TLB is that is allows for things like nested
    page tables or other trickery without needing any actual hardware
    support. Though, you can also easily enough fake software TLB in
    software as well (a host TLB miss pulling from the guest TLB and
    translating the address again).


    Ended up not as inclined towards inverted page tables, as they offer
    fewer benefits than a page walker but would have many of the same issues
    in terms of implementation complexity (needs to access RAM and perform multiple memory accesses to resolve a miss, ...). The page walker then
    is closer to the end goal, whereas the IPT is basically just a much
    bigger RAM-backed TLB.



    Actually, it is not too far removed from doing a weaker (not-quite-IEEE)
    FPU in hardware, and then using optional traps to emulate full IEEE
    behavior (nevermind if such an FPU encountering things like subnormal
    numbers or similar causes performance to tank; and the usual temptation
    to just disable the use of full IEEE semantics).

    ...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 06:16:08 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    It is maybe pushing it a little if one wants to use an AVL-tree or
    B-Tree for virtual memory vs a page-table

    I assume that you mean a balanced search tree (binary (AVL) or n-ary
    (B)) vs. the now-dominant hierarchical multi-level page tables, which
    are tries.

    In both a hardware and a software implementation, one could implement
    a balanced search tree, but what would be the advantage?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 17 10:00:56 2025
    From Newsgroup: comp.arch

    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.
    Recent studies show that TLB-related precise interrupts occur
    once every 100–1000 user instructions on all ranges of code, from
    SPEC to databases and engineering workloads [5, 18]."



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 15:21:38 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.

    I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
    with hardware table walking, on a 1000x1000 matrix multiply with
    pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
    cost about 20 cycles.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 11:29:20 2025
    From Newsgroup: comp.arch

    On 8/17/2025 1:16 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    It is maybe pushing it a little if one wants to use an AVL-tree or
    B-Tree for virtual memory vs a page-table

    I assume that you mean a balanced search tree (binary (AVL) or n-ary
    (B)) vs. the now-dominant hierarchical multi-level page tables, which
    are tries.


    Yes.

    AVL tree is a balanced binary tree that tracks depth and "rotates" nodes
    as needed to keep the depth of one side within +/- 1 of the other.

    The B-Trees would use N elements per node, which are stored in sorted
    order so that one can use a binary search.


    In both a hardware and a software implementation, one could implement
    a balanced search tree, but what would be the advantage?


    Can use less RAM for large sparse address spaces with aggressive ASLR.
    However. looking up a page or updating the page table are significantly
    slower (enough to be relevant).

    Though, I mostly ended up staying with more conventional page tables and weakening the ASLR, where it may try to reuse the previous bits (47:25)
    and (47:36) of the address a few times, to reduce page-table
    fragmentation (sparse, mostly-empty, page table pages).

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 17 13:35:03 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.
    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    Is that possible with a PAL before it has been programmed?

    They can speed and partially function test it.
    Its programmed by blowing internal fuses which is a one-shot thing
    so that function can't be tested.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,
    Should be free coming from a Flip-Flop.
    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like 74LS375. >>
    For a wide instruction or stage register I'd look at chips such as a 74LS377 >> with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

    So if you need eight ouputs, you choice is to use two 74LS375
    (presumably more expensive) or an 74LS377 and an eight-chip
    inverter (a bit slower, but intervers should be fast).

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.
    I'm just showing why it was more than just an AND gate.

    Two layers of NAND :-)

    Thinking about different ways of doing this...
    If the first NAND layer has open collector outputs then we can use
    a wired-AND logic driving and invertor for the second NAND plane.

    If the instruction buffer outputs to a set of 74159 4:16 demux with
    open collector outputs, then we can just wire the outputs we want
    together with a 10k pull-up resistor and drive an invertor,
    to form the second output NAND layer.

    inst buf <15:8> <7:0>
    | | | |
    4:16 4:16 4:16 4:16
    vvvv vvvv vvvv vvvv
    10k ---|---|---|---|------>INV->
    10k ---------------------->INV->
    10k ---------------------->INV->

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Agreed, the logic has to go somewhere. Regularity in the
    instruction set would even have been extremely important than now
    to reduce the logic requirements for decoding.

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 12:53:32 2025
    From Newsgroup: comp.arch

    On 8/17/2025 9:00 AM, EricP wrote:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software.  While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.


    I am not saying SW page walkers are fast.
    Though in my experience, the cycle cost of the SW TLB miss handling
    isn't "too bad".

    If it were a bigger issue in my case, could probably add a HW page
    walker, as I had long considered it as a possible optional feature. In
    this case, it could be per-process (with the LOBs of the page-base
    register also encoding whether or not HW page-walking is allowed; along
    with in my case also often encoding the page-table type/layout).


    In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.
    Recent studies show that TLB-related precise interrupts occur
    once every 100–1000 user instructions on all ranges of code, from
    SPEC to databases and engineering workloads [5, 18]."


    This is around 2 orders of magnitude more than I am often seeing in my
    testing (mind you, with a TLB miss handler that is currently written in C).


    But, this is partly where things like page-sizes and also the size of
    the TLB can have a big effect.

    Ideally, one wants a TLB that has a coverage larger than the working set
    of the typical applications (and OS); at which point miss rate becomes negligible. Granted, if one has GB's of RAM, and larger programs, this
    is a harder problem...


    Then the ratio of working set to TLB coverage comes into play, which
    granted (sadly) appears to follow an (workingSet/coverage)^2 curve...


    I had noted before that some of the 90s era RISC's had comparably very
    small TLBs, such as 64-entry fully associative, or 16x4.
    Such a TLB with a 4K page size having a coverage of roughly 256K.

    Where, most programs have working sets somewhat larger than 256K.

    Looking it up, the DEC Alpha used a 48 entry TLB, so 192K coverage, yeah...


    The CPU time cost of TLB Miss handling would be significantly reduced
    with a "not pissant" TLB.



    I was mostly using 256x4, with a 16K page size, which covers a working
    set of roughly 16MB.

    A 1024x4 would cover 64MB, and 1024x6 would cover 96MB.

    One possibility though would be to use 64K pages for larger programs,
    which would increase coverage of a 1024x TLB to 256MB or 384MB.

    At present, a 1024x4 TLB would use 64K of Block-RAM, and 1024x6 would
    use 98K.

    But, yeah... this is comparable to the apparent TLB sizes on a lot of
    modern ARM processors; which typically deal with somewhat larger working
    sets than I am dealing with.


    Another option is to RAM-back part of the TLB, essentially as an
    "Inverted Page Table", but admittedly, this has similar complexities to
    a HW page walker (and the hassle of still needing a fault handler to
    deal with missing IPT entries).



    In an ideal case, could make sense to write at least the fast path of
    the miss handler in ASM.

    Note that TLB misses are segregated into their own interrupt category
    separate from other interrupt:
    8: General Fault (Memory Faults, Instruction Faults, FPU Traps)
    A: TLB Miss (TLB Miss, ACL Miss)
    C: Interrupt (1kHz HW timer mostly)
    E: Syscall (System Calls)

    Typically, the VBR layout looks like:
    + 0: Reset (typically only used on boot, with VBR reset to 0)
    + 8: General Fault
    +16: TLB Miss
    +24: Interrupt
    +32: Syscall
    With a non-standard alignment requirement (vector table needs to be
    aligned to a multiple of 256 bytes, for "reasons"). Though actual CPU
    core currently only needs a 64B alignment (256B would allow adding a lot
    more vectors while staying with the use of bit-slicing). Each "entry" in
    this table being a branch to the entry point of the ISR handler.

    On initial Boot, as a little bit of a hack, the CPU looks at the
    encoding of the Reset Vector branch to determine the initial ISA Mode
    (such as XG1, XG3, or RV64GC).



    If doing a TLB miss handler in ASM, possible strategy could be:
    Save off some of the registers;
    Check if a simple case TLB miss or ACL miss;
    Try to deal with it;
    Restore registers;
    Return.
    Save rest of registers;
    Deal with more complex scenario (probably in C land);
    Such as initiate a context switch to the page-fault handler.

    For the simple cases:
    TLB Miss involves walking the page table;
    ACL miss may involve first looking up the ID pairs in a hash table;
    Fallback cases may involve more complex logic in a more general handler.



    At present, the Interrupt and Syscall handlers have the quirk in that
    they require TBR to be set-up first, as they directly save to the
    register save area (relative to) TBR, rather than using the interrupt
    stack. The main rationale here being that these interrupts frequently
    perform context switches and saving/restoring registers to TBR greatly
    reduces the performance cost of performing a context switch.

    Note though that one ideally wants to use shared address spaces or ASIDs
    to limit the amount of TLB misses.

    Can note that currently my CPU core uses 16-bit ASIDs, split into 6+10
    bits, currently 64 groups, each with 1024 members. Global pages are
    generally only global within a groups, and high numbered groups are
    assumed to not allow global pages. Say, for example, if you were running
    a VM, you wouldn't want its VAS being polluted with global pages from
    the host OS.

    Though, global pages would allow things like DLLs and similar to be
    shared without needing TLB misses for them on context switches.

    ...





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jakob Bohm@egenagwemdimtapsar@jbohm.dk to comp.arch,comp.lang.c on Sun Aug 17 20:18:36 2025
    From Newsgroup: comp.arch

    On 2025-08-05 23:08, Kaz Kylheku wrote:
    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.

    That is language lawyer's type of reasoning. Normally gcc maintainers
    are wiser than that because, well, by chance gcc happens to be widely
    used production compiler. I don't know why this time they had chosen
    less conservative road.

    They invented an identifer which lands in the _[A-Z].* namespace
    designated as reserved by the standard.

    What would be an exmaple of a more conservative way to name the
    identifier?


    What is actually going on is GCC offering its users a gradual way to transition from C17 to C23, by applying the C23 meaning of any C23
    construct that has no conflicting meaning in C17 . In particular, this
    allows installed library headers to use the new types as part of
    logically opaque (but compiler visible) implementation details, even
    when those libraries are used by pure C17 programs. For example, the
    ISO POSIX datatype struct stat could contain a _BitInt(128) type for
    st_dev or st_ino if the kernel needs that, as was the case with the 1996
    NT kernel . Or a _BitInt(512) for st_uid as used by that same kernel .

    GCC --pedantic is an option to check if a program is a fully conforming portable C program, with the obvious exception of the contents of any
    used "system" headers (including installed libc headers), as those are
    allowed to implement standard or non-standard features in implementation specific ways, and might even include implementation specific logic to
    report the use of non-standard extensions to the library standards when
    the compiler is invoked with --pedantic and no contrary options .

    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C instead
    of GNUC reverts those to the standard definition .

    Enjoy

    Jakob
    --
    Jakob Bohm, MSc.Eng., I speak only for myself, not my company
    This public discussion message is non-binding and may contain errors
    All trademarks and other things belong to their owners, if any.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 19:10:21 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    ... if the buffers fill up and there is not enough resources left for
    the TLB miss handler.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 15:08:14 2025
    From Newsgroup: comp.arch

    On 8/17/2025 2:10 PM, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    ... if the buffers fill up and there is not enough resources left for
    the TLB miss handler.


    If the processor has microcode, could try to handle it that way.

    If it could work, and the CPU allows sufficiently complex logic in
    microcode to deal with this.

    ...



    One idea I had considered early on would be that there is would be a
    special interrupt class that always goes into the ROM; so to the OS it
    would always looks as-if there were a HW page walker.

    This was eventually ended though as I was typically using 32K for the
    Boot ROM, and with the initial startup tests, font initialization, and
    FAT32 driver + PEL and elf loaders, ..., there wasn't much space left
    for "niceties" like TLB miss handling and similar. So, the role of the
    ROM was largely reduced to initial boot-up.

    It could be possible to have a "2-stage ROM", where the first stage boot
    ROM also loads more "ROM" from the SDcard. But, at that point, may as
    well just go over to using the current loader design to essentially try
    to load a UEFI BIOS or similar (which could then load the OS, achieving basically the same effect).

    Where, in effect, UEFI is basically an OS in its own right, which just
    so happens to use similar binary formats to what I am using already (eg, PE/COFF).

    Not yet gone up the learning curve for how to make TestKern behave like
    a UEFI backend though (say, for example, if I wanted to try to get
    "Debian RV64G" or similar to boot on my stuff).


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 18:56:49 2025
    From Newsgroup: comp.arch

    On 8/17/2025 12:35 PM, EricP wrote:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns.  Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.
    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    Is that possible with a PAL before it has been programmed?

    They can speed and partially function test it.
    Its programmed by blowing internal fuses which is a one-shot thing
    so that function can't be tested.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,
    Should be free coming from a Flip-Flop.
    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like
    74LS375.

    For a wide instruction or stage register I'd look at chips such as a
    74LS377
    with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable,
    vcc, gnd.

    So if you need eight ouputs, you choice is to use two 74LS375
    (presumably more expensive) or an 74LS377 and an eight-chip
    inverter (a bit slower, but intervers should be fast).

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more.  If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.
    I'm just showing why it was more than just an AND gate.

    Two layers of NAND :-)

    Thinking about different ways of doing this...
    If the first NAND layer has open collector outputs then we can use
    a wired-AND logic driving and invertor for the second NAND plane.

    If the instruction buffer outputs to a set of 74159 4:16 demux with
    open collector outputs, then we can just wire the outputs we want
    together with a 10k pull-up resistor and drive an invertor,
    to form the second output NAND layer.

    inst buf <15:8>   <7:0>
             |    |   |   |
           4:16 4:16 4:16 4:16
           vvvv vvvv vvvv vvvv
      10k  ---|---|---|---|------>INV->
      10k  ---------------------->INV->
      10k  ---------------------->INV->

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Agreed, the logic has to go somewhere.  Regularity in the
    instruction set would even have been extremely important than now
    to reduce the logic requirements for decoding.

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.


    When code/density is the goal, a 16/32 RISC can do well.

    Can note:
    Maximizing code density often prefers fewer registers;
    For 16-bit instructions, 8 or 16 registers is good;
    8 is rather limiting;
    32 registers uses too many bits.


    Can note ISAs with 16 bit encodings:
    PDP-11: 8 registers
    M68K : 2x 8 (A and D)
    MSP430: 16
    Thumb : 8|16
    RV-C : 8|32
    SuperH: 16
    XG1 : 16|32 (Mostly 16)


    In my recent fiddling for trying to design a pair encoding for XG3, can
    note the top-used instructions are mostly, it seems (non Ld/St):
    ADD Rs, 0, Rd //MOV Rs, Rd
    ADD X0, Imm, Rd //MOV Imm, Rd
    ADDW Rs, 0, Rd //EXTS.L Rs, Rd
    ADDW Rd, Imm, Rd //ADDW Imm, Rd
    ADD Rd, Imm, Rd //ADD Imm, Rd

    Followed by:
    ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
    ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
    ADDW Rd, Rs, Rd //ADDW Rs, Rd
    ADD Rd, Rs, Rd //ADD Rs, Rd
    ADDWU Rd, Rs, Rd //ADDWU Rs, Rd

    Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.

    For Load/Store:
    SD Rn, Disp(SP)
    LD Rn, Disp(SP)
    LW Rn, Disp(SP)
    SW Rn, Disp(SP)

    LD Rn, Disp(Rm)
    LW Rn, Disp(Rm)
    SD Rn, Disp(Rm)
    SW Rn, Disp(Rm)


    For registers, there is a split:
    Leaf functions:
    R10..R17, R28..R31 dominate.
    Non-Leaf functions:
    R10, R18..R27, R8/R9

    For 3-bit configurations:
    R8..R15 Reg3A
    R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

    Reg3B was a bit hacky, but had similar hit rates but uses less encoding
    space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Sun Aug 17 22:18:28 2025
    From Newsgroup: comp.arch

    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 18 05:48:00 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107mf9l$u2si$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    It's not clear to me what the distinction of technical vs. business
    is supposed to be in the context of ISA design. Could you explain?

    I can attempt to, though I'm not sure if I can be successful.

    [...]

    And so with the VAX, I can imagine the work (which started in,
    what, 1975?) being informed by a business landscape that saw an
    increasing trend towards favoring high-level languages, but also
    saw the continued development of large, bespoke, business
    applications for another five or more years, and with customers
    wanting to be able to write (say) complex formatting sequences
    easily in assembler (the EDIT instruction!), in a way that was
    compatible with COBOL (so make the COBOL compiler emit the EDIT instruction!), while also trying to accommodate the scientific
    market (POLYF/POLYG!) who would be writing primarily in FORTRAN
    but jumping to assembler for the fuzz-busting speed boost (so
    stabilize what amounts to an ABI very early on!), and so forth.

    I had actually forgotten that the VAX also had decimal
    instructions. But the 11/780 also had one really important
    restriction: It could only do one write every six cycles, see https://dl.acm.org/doi/pdf/10.1145/800015.808199 , so that
    severely limited their throughput there (assuming they did
    things bytewise). So yes, decimal arithmetic was important
    in the day for COBOL and related commercial applications.

    So, what to do with decimal arithmetic, which was important
    at the time (and a business consideration)?

    Something like Power's addg6s instruction could have been
    introduced, it adds two numbers together, generating only the
    decimal carries, and puts a nibble "6" into the corresponding
    nibble if there is one, and "0" otherwise. With 32 bits, that
    would allow addition of eight-digit decimal numbers in four
    instructions (see one of the POWER ISA documents for details),
    but the cycle of "read ASCII digits, do arithmetic, write
    ASCII digits" would have needed some extra shifts and masks,
    so it might have been more beneficial to use four digits per
    register.

    The article above is also extremely interesting otherwise. It does
    not give cycle timings for each individual instruction and address
    mode, but it gives statistics on how they were used, and a good
    explanation of the timing implications of their microcode design.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Richard Heathfield@rjh@cpax.org.uk to comp.arch,comp.lang.c on Mon Aug 18 08:02:30 2025
    From Newsgroup: comp.arch

    On 18/08/2025 06:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?

    $ cat so.c
    #include <stdio.h>

    int main(void)
    {
    int foo = 42;
    size_t soa = sizeof (foo, 'C');
    size_t sob = sizeof foo;
    printf("%s.\n", (soa == sob) ? "Yes" : "No");
    return 0;
    }
    $ gcc -o so so.c
    $ ./so
    Yes.
    $ gcc --version
    gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch,comp.lang.c on Mon Aug 18 11:34:49 2025
    From Newsgroup: comp.arch

    On 18.08.2025 07:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?


    Presumably that's a typo - you meant to ask when the size is /not/ the
    size of "int" ? After all, you said yourself that "(foo, 'C')"
    evaluates to 'C' which is of type "int". It would be very interesting
    if Jakob can show an example where gcc treats the expression as any
    other type than "int".


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Aug 18 11:03:15 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:
    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's >>>> not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.

    Yeah, this approach works a lot better than people seem to give it
    credit for...
    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.
    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    None of those research papers that I have seen consider the possibility
    that OoO can make use of multiple concurrent HW walkers if the
    cache supports hit-under-miss and multiple pending miss buffers.

    While instruction fetch only needs to occasionally translate a VA one
    at a time, with more aggressive alternate path prefetching all those VA
    have to be translated first before the buffers can be prefetched.
    LSQ could also potentially be translating as many VA as there are entries.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.
    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.

    I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
    with hardware table walking, on a 1000x1000 matrix multiply with
    pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
    cost about 20 cycles.

    - anton

    I'm looking for papers that separate out the common cost of loading a PTE
    from the extra cost of just the SW-miss handler. I had a paper a while
    back but can't find it now. IIRC in that paper the extra cost of the
    SW miss handler on Alpha was measured at 5-25%.

    One thing to mention about some of these papers looking at TLB performance. Some papers on virtual address translate appear to NOT be familiar
    that Intel's HW walker on its downward walk caches the interior node
    PTE's in auxiliary TLB's and checks for PTE TLB hits in bottom to top order (called a bottom-up walk) and thereby avoids many HW walks from the root.

    A SW walker can accomplish the same bottom-up walk by locating
    the different page table levels at *virtual* base addresses,
    and adding each VA of those interior PTE's to the TLB.
    This is what VAX VA translate did, probably Alpha too but I didn't check.

    This interior PTE node caching is critical for optimal performance
    and some of their stats don't take it into account
    and give much worse numbers than they should.

    Also many papers were written before ASID's were in common use
    so the TLB got invalidated with each address space switch.
    This would penalize any OS which had separate user and kernel space.

    So all these numbers need to be taken with a grain of salt.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Aug 18 15:35:36 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.

    Another advantage is the flexibility: you can implement any
    translation scheme you want: hierarchical page tables, inverted page
    tables, search trees, .... However, given that hierarchical page
    tables have won, this is no longer an advantage anyone cares for.

    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    On an OoO engine, I don't see that. The table walker software is
    called in its special context and the instructions in the table walker
    are then run through the front end and the OoO engine. Another table
    walk could be started at any time (even when the first table walk has
    not yet finished feeding its instructions to the front end), and once
    inside the OoO engine, the execution is OoO and concurrent anyway. It
    would be useful to avoid two searches for the same page at the same
    time, but hardware walkers have the same problem.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    Which poses the question: is it cheaper to implement n table walkers,
    or to add some resources and mechanism that allows doing SW table
    walks until the OoO engine runs out of resources, and a recovery
    mechanism in that case.

    I see other performance and conceptual disadvantages for the envisioned
    SW walkers, however:

    1) The SW walker is inserted at the front end and there may be many
    ready instructions ahead of it before the instructions of the SW
    walker get their turn. By contrast, a hardware walker sits in the
    load/store unit and can do its own loads and stores with priority over
    the program-level loads and stores. However, it's not clear that
    giving priority to table walking is really a performance advantage.

    2) Some decisions will have to be implemented as branches, resulting
    in branch misses, which cost time and lead to all kinds of complexity
    if you want to avoid resetting the whole pipeline (which is the normal
    reaction to a branch misprediction).

    3) The reorder buffer processes instructions in architectural order.
    If the table walker's instructions get their sequence numbers from
    where they are inserted into the instruction stream, they will not
    retire until after the memory access that waits for the table walker
    is retired. Deadlock!

    It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
    it's probably easier to stay with hardware walkers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Aug 18 17:19:13 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.
    the same problem.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want >atomicity there.

    To avoid race conditions with software clearing those bits, presumably.

    ARM64 originally didn't support hardware updates in V8.0, they were
    independent hardware features added to V8.1.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Mon Aug 18 21:57:59 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    On 18.08.2025 07:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .
    I'm not sure what you're referring to. You didn't say what foo is.
    I believe that in all versions of C, the result of a comma operator
    has
    the type and value of its right operand, and the type of an unprefixed
    character constant is int.
    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?

    Presumably that's a typo - you meant to ask when the size is /not/ the
    size of "int" ? After all, you said yourself that "(foo, 'C')"
    evaluates to 'C' which is of type "int". It would be very interesting
    if Jakob can show an example where gcc treats the expression as any
    other type than "int".

    Yes (more of a thinko, actually).

    I meant to ask about `sizeof (foo, 'C')` yielding a value *other than*
    `sizeof (int)`. Jakob implies a difference in this area between GNU C
    and ISO C. I'm not aware of any.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 19 05:47:01 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
    Intel Lion Cove, I'd do the following modification to your inner loop
    (back in Intel syntax):

    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc edx,edx
    add rax,[r9+rcx*8]
    adc edx,0
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret


    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
    inc edx
    jmp edx_ready

    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No
    dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry
    register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    However, even without the loop overhead (which can be reduced with
    unrolling) that's 8 instructions per iteration, and therefore we will
    have a hard time executing it at less than 1cycle/iteration on current
    CPUs. What if we mix in some adc-based stuff to bring down the
    instruction count? E.g., with one adc-based and one cmov-based
    iteration:

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    add [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    Now we have 15 instructions per unrolled iteration (3 original
    iterations). Executing an unrolled iteration in less than three
    cycles might be in reach for Zen3 and Raptor Cove (I don't remember if
    all the other resource limits are also satisfied; the load/store unit
    may be at its limit, too).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 19 07:09:51 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No >dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one
    iteration would be:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Aug 19 12:11:56 2025
    From Newsgroup: comp.arch

    Anton, I like what you and Michael have done, but I'm still not sure everything is OK:

    In your code, I only see two input arrays [rsi] and [r8], instead of
    three? (Including [r9])

    Re breaking dependency chains (from Michael):

    In each iteration we have four inputs:

    carry_in from the previous iteration, [rsi+rcx*8], [r8+rcx*8] and
    [r9+rcx*8], and we want to generate [rdi+rcx*8] and the carry_out.

    Assuming effectively random inputs, cin+[rsi]+[r8]+[r9] will result in
    random low-order 64 bits in [rdi], and either 0, 1 or 2 as carry_out.

    In order to break the per-iteration dependency (per Michael), it is
    sufficient to branch out IFF adding cin to the 3-sum produces an
    additional carry:

    ; rdx = cin (0,1,2)
    next:
    mov rbx,rdx ; Save CIN
    xor rdx,rdx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc rdx,rdx ; RDX = 0 or 1 (50:50)
    add rax,[r9+rcx*8]
    adc rdx,0 ; RDX = 0, 1 or 2 (33:33:33)

    ; At this point RAX has the 3-sum, now do the cin 0..2 add

    add rax,rbx
    jc fixup ; Pretty much never taken

    save:
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next

    fixup:
    inc rdx
    jmp save

    It would also be possible to use SETC to save the intermediate carries...

    Terje

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No
    dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry
    register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never
    incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one iteration would be:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    - anton

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 19 17:20:54 2025
    From Newsgroup: comp.arch

    On Tue, 19 Aug 2025 07:09:51 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready
    and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It
    all makes more sense with that. ebx then contains the carry from
    the last cycle on entry. The carry dependency chain starts at
    clearing edx, then gets to additional carries, then is copied to
    ebx, transferred into the next iteration, and is ended there by
    overwriting ebx. No dependency cycles (except the loop counter and >addresses, which can be dealt with by hardware or by unrolling), and
    ebx contains the carry from the last iteration

    One other problem is that according to Agner Fog's instruction
    tables, even the latest and greatest CPUs from AMD and Intel that he >measured (Zen5 and Tiger Lake) can only execute one adc/adcx/adox
    per cycle, and adc has a latency of 1, so breaking the dependency
    chain in a beneficial way should avoid the use of adc. For our >three-summand add, it's not clear if adcx and adox can run in the
    same cycle, but looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one iteration would be:

    To many back and force mental switches between Intel and AT&T syntax.
    The real code that I measured was for Windows platform, but in AT&T
    (gnu) syntax.
    Below is full function with loop unrolled by 3. The rest may be
    I'd answer later, right now I don't have time.

    .file "add3_my_u3.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type
    32; .endef
    .seh_proc add3
    add3:
    pushq %r13
    .seh_pushreg %r13
    pushq %r12
    .seh_pushreg %r12
    pushq %rbp
    .seh_pushreg %rbp
    pushq %rdi
    .seh_pushreg %rdi
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx
    sub %rcx, %r8
    sub %rcx, %r9
    mov $341, %ebx
    xor %eax, %eax
    .loop:
    xor %esi, %esi
    mov (%rcx,%rdx), %rdi
    mov 8(%rcx,%rdx), %rbp
    mov 16(%rcx,%rdx), %r10
    add (%rcx,%r8), %rdi
    adc 8(%rcx,%r8), %rbp
    adc 16(%rcx,%r8), %r10
    adc %esi, %esi
    add (%rcx,%r9), %rdi
    adc 8(%rcx,%r9), %rbp
    adc 16(%rcx,%r9), %r10
    adc $0, %esi
    add %rax, %rdi # add carry from the previous
    iteration
    jc .prop_carry
    .carry_done:
    mov %esi, %eax
    mov %rdi, (%rcx)
    mov %rbp, 8(%rcx)
    mov %r10, 16(%rcx)
    lea 24(%rcx), %rcx
    dec %ebx
    jnz .loop

    sub $(1023*8), %rcx
    mov %rcx, %rax

    popq %rbx
    popq %rsi
    popq %rdi
    popq %rbp
    popq %r12
    popq %r13
    ret

    .prop_carry:
    add $1, %rbp
    adc $0, %r10
    adc $0, %esi
    jmp .carry_done

    .seh_endproc








    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 19 17:24:23 2025
    From Newsgroup: comp.arch

    Above by mistake I posted not the most up to date variant, sorry.
    Here is a correct code:

    .file "add3_my_u3.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type
    32; .endef .seh_proc add3
    add3:
    pushq %rbp
    .seh_pushreg %rbp
    pushq %rdi
    .seh_pushreg %rdi
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx
    sub %rcx, %r8
    sub %rcx, %r9
    mov $341, %ebx
    xor %eax, %eax
    .loop:
    xor %esi, %esi
    mov (%rcx,%rdx), %rdi
    mov 8(%rcx,%rdx), %r10
    mov 16(%rcx,%rdx), %r11
    add (%rcx,%r8), %rdi
    adc 8(%rcx,%r8), %r10
    adc 16(%rcx,%r8), %r11
    adc %esi, %esi
    add (%rcx,%r9), %rdi
    adc 8(%rcx,%r9), %r10
    adc 16(%rcx,%r9), %r11
    adc $0, %esi
    add %rax, %rdi # add carry from the previous
    iteration jc .prop_carry
    .carry_done:
    mov %esi, %eax
    mov %rdi, (%rcx)
    mov %r10, 8(%rcx)
    mov %r11, 16(%rcx)
    lea 24(%rcx), %rcx
    dec %ebx
    jnz .loop

    sub $(1023*8), %rcx
    mov %rcx, %rax

    popq %rbx
    popq %rsi
    popq %rdi
    popq %rbp
    ret

    .prop_carry:
    add $1, %r10
    adc $0, %r11
    adc $0, %esi
    jmp .carry_done

    .seh_endproc


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 19 17:43:25 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton, I like what you and Michael have done, but I'm still not sure >everything is OK:

    In your code, I only see two input arrays [rsi] and [r8], instead of
    three? (Including [r9])

    I implemented a two-summand addition, not three-summand. I wanted the
    minumum of complexity to make it easier to understand, and latency is
    a bigger problem for the two-summand case.

    It would also be possible to use SETC to save the intermediate carries...

    I must have had a bad morning. Instead of xor edx, edx, setc dl (also
    2 per cycle on Zen3), I wrote

    mov edi,1
    ...
    xor edx, edx
    ...
    cmovc edx, edi

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 19 23:03:01 2025
    From Newsgroup: comp.arch

    On Tue, 19 Aug 2025 05:47:01 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,

    I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
    are certainly capable of more than 1 adcx|adox per cycle.

    Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:

    Platform RC GM SK Z3
    add3_my_adx_u17 244.5 471.1 482.4 407.0

    Considering that there are 2166 adcx/adox/adc instructions, we have
    following number of adcx/adox/adc instructions per clock:
    Platform RC GM SK Z3
    1.67 1.10 1.05 1.44

    For Gracemont and Skylake there exists a possibility of small
    measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
    but more likely also 2.
    It looks to me that the bottlenecks on both RC and Z3 are either rename
    phase or more likely L1$ access. It seems that while Golden/Raptore Cove
    can occasionally issue 3 load + 2 stores per clock, it can not sustain
    more than 3 load-or-store accesses per clock.


    Code:

    .file "add3_my_adx_u17.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rdx, %rcx
    mov %rcx, %r10 # r10 = dst - a
    sub %rdx, %r8 # r8 = b - a
    sub %rdx, %r9 # r9 = c - c
    mov %rdx, %r11 # r11 - a
    mov $60, %edx
    xor %ecx, %ecx
    .p2align 4
    .loop:
    xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
    mov (%r11), %rsi
    adcx (%r11,%r8), %rsi
    adox (%r11,%r9), %rsi

    mov 8(%r11), %rax
    adcx 8(%r11,%r8), %rax
    adox 8(%r11,%r9), %rax
    mov %rax, 8(%r10,%r11)

    mov 16(%r11), %rax
    adcx 16(%r11,%r8), %rax
    adox 16(%r11,%r9), %rax
    mov %rax, 16(%r10,%r11)

    mov 24(%r11), %rax
    adcx 24(%r11,%r8), %rax
    adox 24(%r11,%r9), %rax
    mov %rax, 24(%r10,%r11)

    mov 32(%r11), %rax
    adcx 32(%r11,%r8), %rax
    adox 32(%r11,%r9), %rax
    mov %rax, 32(%r10,%r11)

    mov 40(%r11), %rax
    adcx 40(%r11,%r8), %rax
    adox 40(%r11,%r9), %rax
    mov %rax, 40(%r10,%r11)

    mov 48(%r11), %rax
    adcx 48(%r11,%r8), %rax
    adox 48(%r11,%r9), %rax
    mov %rax, 48(%r10,%r11)

    mov 56(%r11), %rax
    adcx 56(%r11,%r8), %rax
    adox 56(%r11,%r9), %rax
    mov %rax, 56(%r10,%r11)

    mov 64(%r11), %rax
    adcx 64(%r11,%r8), %rax
    adox 64(%r11,%r9), %rax
    mov %rax, 64(%r10,%r11)

    mov 72(%r11), %rax
    adcx 72(%r11,%r8), %rax
    adox 72(%r11,%r9), %rax
    mov %rax, 72(%r10,%r11)

    mov 80(%r11), %rax
    adcx 80(%r11,%r8), %rax
    adox 80(%r11,%r9), %rax
    mov %rax, 80(%r10,%r11)

    mov 88(%r11), %rax
    adcx 88(%r11,%r8), %rax
    adox 88(%r11,%r9), %rax
    mov %rax, 88(%r10,%r11)

    mov 96(%r11), %rax
    adcx 96(%r11,%r8), %rax
    adox 96(%r11,%r9), %rax
    mov %rax, 96(%r10,%r11)

    mov 104(%r11), %rax
    adcx 104(%r11,%r8), %rax
    adox 104(%r11,%r9), %rax
    mov %rax, 104(%r10,%r11)

    mov 112(%r11), %rax
    adcx 112(%r11,%r8), %rax
    adox 112(%r11,%r9), %rax
    mov %rax, 112(%r10,%r11)

    mov 120(%r11), %rax
    adcx 120(%r11,%r8), %rax
    adox 120(%r11,%r9), %rax
    mov %rax, 120(%r10,%r11)

    lea 136(%r11), %r11

    mov -8(%r11), %rax
    adcx -8(%r11,%r8), %rax
    adox -8(%r11,%r9), %rax
    mov %rax, -8(%r10,%r11)

    mov %ebx, %eax # EAX <= 0
    adcx %ebx, %eax # EAX <= OF, OF <= 0
    adox %ebx, %eax # EAX <= OF, OF <= 0

    add %rcx, %rsi
    jc .prop_carry
    .carry_done:
    mov %rsi, -136(%r10,%r11)
    mov %eax, %ecx
    dec %edx
    jnz .loop

    # last 3
    mov (%r11), %rax
    mov 8(%r11), %rdx
    mov 16(%r11), %rbx
    add (%r11,%r8), %rax
    adc 8(%r11,%r8), %rdx
    adc 16(%r11,%r8), %rbx
    add (%r11,%r9), %rax
    adc 8(%r11,%r9), %rdx
    adc 16(%r11,%r9), %rbx
    add %rcx, %rax
    adc $0, %rdx
    adc $0, %rbx
    mov %rax, (%r10,%r11)
    mov %rdx, 8(%r10,%r11)
    mov %rbx, 16(%r10,%r11)

    lea (-1020*8)(%r10,%r11), %rax
    popq %rbx
    popq %rsi
    ret

    .prop_carry:
    lea -128(%r10,%r11), %rbx
    xor %ecx, %ecx
    addq $1, (%rbx)
    adc %rcx, 8(%rbx)
    adc %rcx, 16(%rbx)
    adc %rcx, 24(%rbx)
    adc %rcx, 32(%rbx)
    adc %rcx, 40(%rbx)
    adc %rcx, 48(%rbx)
    adc %rcx, 56(%rbx)
    adc %rcx, 64(%rbx)
    adc %rcx, 72(%rbx)
    adc %rcx, 80(%rbx)
    adc %rcx, 88(%rbx)
    adc %rcx, 96(%rbx)
    adc %rcx,104(%rbx)
    adc %rcx,112(%rbx)
    adc %rcx,120(%rbx)
    adc %ecx, %eax
    jmp .carry_done
    .seh_endproc








    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Aug 20 01:49:41 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> wrote:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size >>because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors.

    I think we're agreeing that even in the early 1980s a 512 byte page was
    too small. They certainly couldn't have made it any smaller, but they
    should have made it larger.

    S/370 was a decade before that and its pages were 2K or 4M. The KI-10,
    the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
    on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.

    Several posts above I wrote:

    : I think that in 1979 VAX 512 bytes page was close to optimal.
    : Namely, IIUC smallest supported configuration was 128 KB RAM.
    : That gives 256 pages, enough for sophisticated system with
    : fine-grained access control.

    Note that 360 has optional page protection used only for access
    control. In 370 era they had legacy of 2k or 4k pages, and
    AFAICS IBM was mainly aiming at bigger machines, so they
    were not so worried about fragmentation. PDP-11 experience
    possibly contributed to using smaller pages for VAX.

    Microprocessors were designed with different constraints, which
    led to bigger pages. But VAX apparently could afford resonably
    large TLB and due VMS structure gain was bigger than for other
    OS-es.

    And little correction: VAX architecture handbook is dated 1977,
    so actually decision about page size had to be made at least
    in 1977 and possibly earlier.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Aug 20 02:49:26 2025
    From Newsgroup: comp.arch

    According to Waldek Hebisch <antispam@fricas.org>:
    S/370 was a decade before that and its pages were 2K or 4K. The KI-10,
    the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
    on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
    ...

    Note that 360 has optional page protection used only for access
    control. In 370 era they had legacy of 2k or 4k pages, and
    AFAICS IBM was mainly aiming at bigger machines, so they
    were not so worried about fragmentation.

    I don't think so. The smallest 370s were 370/115 with 64K to 192K of
    RAM, 370/125 with 96K to 256K, both with paging hardware and running
    DOS/VS. The 115 was shipped in 1973, the 125 in 1972.

    PDP-11 experience possibly contributed to using smaller pages for VAX.

    The PDP-11's pages were 8K which were too big to be used as pages so
    we used them as a single block for swapping. When I was at Yale I did
    a hack that mapped the 32K display memory for a bitmap terminal into
    the high half of the process' data space but that left too little room
    for regular data so we addressed the display memory a different way that
    didn't use up address spavce.

    Microprocessors were designed with different constraints, which
    led to bigger pages. But VAX apparently could afford resonably
    large TLB and due VMS structure gain was bigger than for other
    OS-es.

    I can only guess what their thinking was, but I can tell you that
    at the time the 512 byte pages seemed oddly small.

    And little correction: VAX architecture handbook is dated 1977,
    so actually decision about page size had to be made at least
    in 1977 and possibly earlier.

    The VAX design started in 1976, well after IBM had shipped those
    low end 370s with tiny memories and 2K pages.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Aug 20 03:47:17 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    The basic question is if VAX could afford the pipeline.

    VAX 11/780 only performed instruction fetching concurrently with the
    rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
    NVAX applied more pipelining, but CPI remained high.

    VUPs MHz CPI Machine
    1 5 10 11/780
    4 12.5 6.25 8600
    6 22.2 7.4 8700
    35 90.9 5.1 NVAX+

    SPEC92 MHz VAX CPI Machine
    1/1 5 10/10 VAX 11/780
    133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

    VUPs and SPEC numbers from
    <https://pghardy.net/paul/programs/vms_cpus.html>.

    The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
    The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
    is probably somewhat off (due to the anecdotal base being off), but if
    you relate them to each other, the offness cancels itself out.

    Note that the NVAX+ was made in the same process as the 21064, the
    21064 has about the clock rate, and has 4-6 times the performance,
    resulting not just in a lower native CPI, but also in a lower "VAX
    CPI" (the CPI a VAX would have needed to achieve the same performance
    at this clock rate).

    Prism paper says the following about RISC versus VAX performance:

    : 1. Shorter cycle time. VAX chips have more, and longer, critical
    : paths than RISC chips. The worst VAX paths are the control store
    : loop and the variable length instruction decode loop, both of
    : which are absent in RISC chips.

    : 2. Fewer cycles per function. Although VAX chips require fewer
    : instructions than RISC chips (1:2.3) to implement a given
    : function, VAX instructions take so many more cycles than RISC
    : instructions (5-10:1-1.5) that VAX chips require many more cycles
    : per function than RISC chips.

    : 3. Increased pipelining. VAX chips have more inter- and
    : intra-instruction dependencies, architectural irregularities,
    : instruction formats, address modes, and ordering requirements
    : than RISC chips. This makes VAX chips harder and more
    : complicated to pipeline.

    Point 1 above for me means that VAX chips were microcoded. Point
    2 above suggest that there were limited changes compared to VAX-780
    microcode.

    IIUC attempts to create better hardware for VAX were canceled
    just after PRISM memos, so later VAX used essential the same
    logic, just rescaled to better process.

    I think that VAX had problem with hardware decoders because of gate
    delay: in 1987 probably hardware decoder would slow down clock.
    But 1977 design for me looks quite relaxed: man logic was Schotky
    TTL which nominaly has 3 ns of inverter delay. With 200 ns cycle
    this means about 66 gate delays per cycle. And in critical paths
    VAX use ECL. I do not exactly which ECL, but AFAIK 2 ns ECL was
    commonly available in 1970 and 1 ns ECL was leading edge in 1970.

    That is why I think that in 1977 hardware decoder could give
    speedup, assuming that execution units could keep up: gate delay
    and cycle time means that rather deep circuit could fit within
    cycle time. IIUC 1987 designs were much more aggressive and
    decoder delay probably could not fit within single cycle.

    Quite possible that hardware designers attempting VAX hardware
    decoders were too ambitious and wanted to decode in one cycle
    too complicated instructions. AFAICS for instructions that can
    not be executed in one cycle decode can be slower than one
    cycle, all what one needs is to recognize withing one cycle
    that decode will take multiple cycles.

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Aug 20 10:50:39 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 19 Aug 2025 05:47:01 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,

    I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
    are certainly capable of more than 1 adcx|adox per cycle.

    Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:

    Platform RC GM SK Z3
    add3_my_adx_u17 244.5 471.1 482.4 407.0

    Considering that there are 2166 adcx/adox/adc instructions, we have
    following number of adcx/adox/adc instructions per clock:
    Platform RC GM SK Z3
    1.67 1.10 1.05 1.44

    For Gracemont and Skylake there exists a possibility of small
    measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
    but more likely also 2.
    It looks to me that the bottlenecks on both RC and Z3 are either rename
    phase or more likely L1$ access. It seems that while Golden/Raptore Cove
    can occasionally issue 3 load + 2 stores per clock, it can not sustain
    more than 3 load-or-store accesses per clock


    Code:

    .file "add3_my_adx_u17.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rdx, %rcx
    mov %rcx, %r10 # r10 = dst - a
    sub %rdx, %r8 # r8 = b - a
    sub %rdx, %r9 # r9 = c - c
    mov %rdx, %r11 # r11 - a
    mov $60, %edx
    xor %ecx, %ecx
    .p2align 4
    .loop:
    xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
    mov (%r11), %rsi
    adcx (%r11,%r8), %rsi
    adox (%r11,%r9), %rsi

    mov 8(%r11), %rax
    adcx 8(%r11,%r8), %rax
    adox 8(%r11,%r9), %rax
    mov %rax, 8(%r10,%r11)

    [snipped the rest]


    Very impressive Michael!

    I particularly like how you are interleaving ADOX and ADCX to gain two
    carry bits without having to save them off to an additional register.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Aug 20 14:16:55 2025
    From Newsgroup: comp.arch

    On Wed, 20 Aug 2025 10:50:39 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:



    Very impressive Michael!

    I particularly like how you are interleaving ADOX and ADCX to gain
    two carry bits without having to save them off to an additional
    register.

    Terje


    It is interesting as an exercise in ADX extension programming, but in
    practice it is only 0-10% faster than much simpler and smaller code
    presented in the other post that uses no ISA extensions so runs on every
    iAMD64 CPU since K8.
    I suspect that this result is quite representative of the gains that
    can be achieved with ADX. May be, if there is a crypto requirement
    of independence of execution time from inputs then the gain would be
    somewhat bigger, but even there I would be very surprised to find 1.5x
    gain.
    Overall, I think that time spent by Intel engineers on invention of ADX
    could have been spent much better.


    Going back to the task of 3-way addition, another approach that can
    utilize the same idea of breaking data dependency is using SIMD.
    In case of 4 cores that I tested SIMD means AVX2.
    The are results of AVX2 implementation that unrolls by two i.e. 512
    output bits per iteration of the inner loop.

    Platform RC GM SK Z3
    add3_avxq_u2 226.7 823.3 321.1 309.5

    The speed is about equal to more unrolled ADX variant on RC, faster on
    Z3, much faster on SK and much slower on GM. Unlike ADX, it runs on
    Intel Haswell and on few pre-Zen AMD CPUs.

    .file "add3_avxq_u2.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    subq $56, %rsp
    .seh_stackalloc 56
    vmovups %xmm6, 32(%rsp)
    .seh_savexmm %xmm6, 32
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx # %rdx - a-dst
    sub %rcx, %r8 # %r8 - b-dst
    sub %rcx, %r9 # %r9 - c-dst
    vpcmpeqq %ymm6, %ymm6, %ymm6
    vpsllq $63, %ymm6, %ymm6 # ymm6[0:3] = msbit = 2**63
    vpxor %xmm5, %xmm5, %xmm5 # ymm5[0] = carry = 0
    mov $127, %eax
    .loop:
    vpxor (%rdx,%rcx), %ymm6, %ymm0
    # ymm0[0:3] = iA[0:3] = a[0:3] - msbit
    vpxor 32(%rdx,%rcx), %ymm6, %ymm1
    # ymm1[0:3] = iA[4:7] = a[4:7] - msbit
    vpaddq (%r8, %rcx), %ymm0, %ymm2
    # ymm2[0:3] = iSum1[0:3] = iA[0:3]+b[0:3]
    vpaddq 32(%r8, %rcx), %ymm1, %ymm3
    # ymm3[0:3] = iSum1[4:7] = iA[4:7] + b[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm4
    # ymm4[0:3] = c1[0:3] = iA[0:3] > iSum1[0:3]
    vpaddq (%r9, %rcx), %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum1[0:3]+c[0:3]
    vpcmpgtq %ymm0, %ymm2, %ymm2
    # ymm2[0:3] = c2[0:3] = iSum1[0:3] > iSum2[0:3]
    vpaddq %ymm4, %ymm2, %ymm2
    # ymm2[0:3] = cSum0[0:3] = c1[0:3]+c2[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm4
    # ymm4[0:3] = c1[4:7] = iA[4:7] > iSum1[4:7]
    vpaddq 32(%r9, %rcx), %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum1[4:7] + c[4:7]
    vpcmpgtq %ymm1, %ymm3, %ymm3
    # ymm3[0:3] = c2[4:7] = iSum1[4:7] > iSum2[4:7]
    vpaddq %ymm4, %ymm3, %ymm3
    # ymm3[0:3] = cSum0[4:7] = c1[4:7] + c2[4:7]
    vpermq $0x93, %ymm2, %ymm4
    # ymm4[0:3] = cSum0[3,0:2]
    vpblendd $3, %ymm5, %ymm4, %ymm2
    # ymm1[0:3] = cSum[0:3] = { carry[0], cSum0[0,1,2] }
    vpermq $0x93, %ymm3, %ymm5
    # ymm5[0:3] = cSum0[7,4:6] == carry
    vpblendd $3, %ymm4, %ymm5, %ymm3
    # ymm3[0:3] = cSum[4:7] = { cSum0[3], cSum0[4:6] }
    .add_carry:
    vpsubq %ymm2, %ymm0, %ymm2
    # ymm2[0:3] = iSum3[0:3] = iSum2[0:3] - cSum[0:3]
    vpsubq %ymm3, %ymm1, %ymm3
    # ymm3[0:3] = iSum3[4:7] = iSum2[4:7] - cSum[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm0
    # ymm0[0:3] = c3[0:3] = iSum2[0:3] > iSum3[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm1
    # ymm3[0:3] = c3[4:7] = iSum2[4:7] > iSum3[4:7]
    vpor %ymm0, %ymm1, %ymm4
    vptest %ymm4, %ymm4
    jne .prop_carry
    vpxor %ymm2, %ymm6, %ymm0
    # ymm0[0:3] = uSum3[0:3] = iSum3[0:3] + msbit
    vpxor %ymm3, %ymm6, %ymm1
    # ymm1[4:7] = uSum3[4:7] = iSum3[4:7] + msbit
    vmovdqu %ymm0, (%rcx)
    vmovdqu %ymm1, 32(%rcx)
    addq $64, %rcx
    dec %eax
    jnz .loop

    # last 7
    vpxor (%rdx,%rcx), %ymm6, %ymm0
    # ymm0[0:3] = iA[0:3] = a[0:3] - msbit
    vpxor 24(%rdx,%rcx), %ymm6, %ymm1
    # ymm1[0:3] = iA[3:6] = a[3:6] - msbit
    vpaddq (%r8, %rcx), %ymm0, %ymm2
    # ymm2[0:3] = iSum1[0:3] = iA[0:3]+b[0:3]
    vpaddq 24(%r8, %rcx), %ymm1, %ymm3
    # ymm3[0:3] = iSum1[3:6] = iA[3:6] + b[3:6]
    vpcmpgtq %ymm2, %ymm0, %ymm4
    # ymm4[0:3] = c1[0:3] = iA[0:3] > iSum1[0:3]
    vpaddq (%r9, %rcx), %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum1[0:3]+c[0:3]
    vpcmpgtq %ymm0, %ymm2, %ymm2
    # ymm2[0:3] = c2[0:3] = iSum1[0:3] > iSum2[0:3]
    vpaddq %ymm4, %ymm2, %ymm2
    # ymm2[0:3] = cSum0[0:3] = c1[0:3]+c2[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm4
    # ymm4[0:3] = c1[3:6] = iA[3:6] > iSum1[3:6]
    vpaddq 24(%r9, %rcx), %ymm3, %ymm1
    # ymm1[0:3] = iSum2[3:6] = iSum1[3:6] + c[3:6]
    vpcmpgtq %ymm1, %ymm3, %ymm3
    # ymm3[0:3] = c2[3:6] = iSum1[3:6] > iSum2[3:6]
    vpaddq %ymm4, %ymm3, %ymm3
    # ymm3[0:3] = cSum[4:7] = cSum0[3:6] = c1[3:6] + c2[367]
    vpermq $0x93, %ymm2, %ymm4
    # ymm2[0:3] = cSum0[3,0,1,2]
    vpblendd $3, %ymm5, %ymm4, %ymm2
    # ymm1[0:3] = cSum[0:3] = { carry[0], cSum0[0,1,2] }
    vpermq $0xF9, %ymm1, %ymm1
    # ymm3[0:3] = iSum2[4:6,6]
    .add_carry2:
    vpsubq %ymm2, %ymm0, %ymm2
    # ymm2[0:3] = iSum3[0:3] = iSum2[0:3] - cSum[0:3]
    vpsubq %ymm3, %ymm1, %ymm3
    # ymm3[0:3] = iSum3[4:7] = iSum2[4:7] - cSum[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm0
    # ymm0[0:3] = c3[0:3] = iSum2[0:3] > iSum3[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm1
    # ymm1[0:3] = c3[4:7] = iSum2[4:7] > iSum3[4:7]
    vptest %ymm0, %ymm0
    jne .prop_carry2
    vptest %ymm1, %ymm1
    jne .prop_carry2
    vpxor %ymm2, %ymm6, %ymm0
    # ymm0[0:3] = uSum3[0:3] = iSum3[0:3] + msbit
    vpxor %ymm3, %ymm6, %ymm1
    # ymm1[4:7] = uSum3[4:7] = iSum3[4:7] + msbit
    vmovdqu %ymm0, (%rcx)
    vmovdqu %xmm1, 32(%rcx)
    vextractf128 $1, %ymm1, %xmm1
    vmovq %xmm1, 48(%rcx)

    lea -(127*64)(%rcx), %rax
    vzeroupper
    vmovups 32(%rsp), %xmm6
    addq $56, %rsp
    ret

    .prop_carry:
    # input:
    # ymm0[0:3] = c3[0:3]
    # ymm1[0:3] = c3[4:7]
    # ymm2[0:3] = iSum3[0:3]
    # ymm3[0:3] = iSum3[4:7]
    # ymm5[0] = carry
    # output:
    # ymm0[0:3] = iSum2[0:3]
    # ymm1[0:3] = iSum2[4:7]
    # ymm2[0:3] = cSum [0:3]
    # ymm3[0:3] = cSum [4:7]
    # ymm5[0] = carry
    # scratch: ymm4
    vpermq $0x93, %ymm0, %ymm4
    # ymm4[0:3] = c3[3,0,1,2]
    vmovdqa %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum3[0:3]
    vpermq $0x93, %ymm1, %ymm2
    # ymm2[0:3] = c3[7,4,5,6]
    vpaddq %xmm2, %xmm5, %xmm5
    # ymm5[0] = carry += c3[7]
    vmovdqa %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum3[4:7]
    vpblendd $3, %ymm4, %ymm2, %ymm3
    # ymm3[0:3] = cSum[4:7] = { c3[3], c3[4,5,6] }
    vpxor %xmm2, %xmm2, %xmm2
    # ymm2[0:3] = 0
    vpblendd $3, %ymm2, %ymm4, %ymm2
    # ymm2[0:3] = cSum[0:3] = { 0, c3[0,1,2] }
    jmp .add_carry

    .prop_carry2:
    # input:
    # ymm0[0:3] = c3[0:3]
    # ymm1[0:3] = c3[4:7]
    # ymm2[0:3] = iSum3[0:3]
    # ymm3[0:3] = iSum3[4:7]
    # ymm5[0] = carry
    # output:
    # ymm0[0:3] = iSum2[0:3]
    # ymm1[0:3] = iSum2[4:7]
    # ymm2[0:3] = cSum [0:3]
    # ymm3[0:3] = cSum [4:7]
    # ymm5[0] = carry
    # scratch: ymm4
    vpermq $0x93, %ymm0, %ymm4
    # ymm4[0:3] = c3[3,0,1,2]
    vmovdqa %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum3[0:3]
    vpermq $0x93, %ymm1, %ymm2
    # ymm2[0:3] = c3[7,4,5,6]
    vmovdqa %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum3[4:7]
    vpblendd $3, %ymm4, %ymm2, %ymm3
    # ymm3[0:3] = cSum[4:7] = { c3[3], c3[4,5,6] }
    vpxor %xmm2, %xmm2, %xmm2
    # ymm2[0:3] = 0
    vpblendd $3, %ymm2, %ymm4, %ymm2
    # ymm2[0:3] = cSum[0:3] = { 0, c3[0,1,2] }
    jmp .add_carry2

    .seh_endproc

    AVX2 is rather poorly suited for this task - it lacks unsigned
    comparison instructions, so the first input should be shifted by
    half-range at the beginning and the result should be shifted back.

    AVX-512 can be more suitable. But the only AVX-512 capable CPU that I
    have access to is miniPC with cheap and slow core-i3 used by family
    members almost exclusively for viewing movies. It does not even have
    minimal programming environments installed.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 20 14:08:34 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    Overall, I think that time spent by Intel engineers on invention of ADX
    could have been spent much better.

    The whitepapers about ADX are about long multiplication and squaring
    (for cryptographic uses), and they are from 2012, and ADX was first
    implemented in Broadwell (2014), when microarchitectures were quite a
    bit narrower than the recent ones.

    If you implement the classical long multiplication algorithm, but add
    each line to the intermediate sum as you create it, you need an
    operation like

    intermediateresult += multiplicand*multiplicator[i]

    where all parts are multi-precision numbers, but only one word of the multiplicator is used. The whole long multiplication would look
    somewhat like:

    intermediateresult=0;
    for (i=0; i<n; i++) {
    intermediateresult += multiplicand*multiplicator[i];
    shift intermediate result by one word; /* actually, you will access it at */
    /* an offset, but how to express this in this pseudocode? */ }

    The operation for a single line can be implemented as:

    carry=0;
    for (j=0; j<m; j++)
    uint128_t d = intermediateresult[j] +
    multiplicand[j]*(uint128_t)multiplicator[i] +
    (uint128_t)carry;
    intermediateresult[j] = d; /* low word */
    carry = d >> 64;
    }

    The computation of d (both words) can be written on AMD64 as:

    #multuplicator[i] in rax
    mov multiplicator[i], rax
    mulq multiplicand[j]
    addq intermediateresult[j], rax
    adcq $0, rdx
    addq carry, rax
    adcq $0, rdx
    mov rdx, carry

    With ADX and BMI2, this can be coded as:

    #carry is represented as carry1+C+O
    mulx ma, m, carry2
    adcx mb, m
    adox carry1, m
    #carry is represented as carry2+C+O
    #unroll by an even factor, and switch roles for carry1 and carry2

    Does it matter? We can apply the usual blocking techniques to
    perform, say, a 4x4 submultiplication in the registers (probably even
    a little larger, but let's stick with these numbers). That's 16
    mulx/adox/adcx combinations, loads of 4+4 words of inputs and stores
    of 8 words of output. mulx is probably limited to one per cycle, but
    if we want to utilize this on a 4-wide machine line the Broadwell, we
    must have at most 3 additional instructions per mulx; with ADX, one
    additional instruction is adcx, another adox, and the third is either
    a load or a store. Any additional overhead, and the code will be
    limited by resources other than the multiplier.

    On today's CPUs, we can reach the 1 mul/cycle limit with the x86-64-v1
    code shown before the ADX code. But then, they might put a second
    multiplier in, and we would profit from ADX again.

    But Intel seems to have second thoughts on ADX itself. ADX has not
    been included in x86-64-4, despite the fact that every CPU that
    supports the other extensions of x86-64-4 also supports ADX. And the whitepapers vanish from Intel's web pages. Some time ago I still
    found it on https://www.intel.cn/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
    (i.e., Intel China), but now it's gone there, too. I can still find
    it on <https://raw.githubusercontent.com/wiki/intel/intel-ipsec-mb/doc/ia-large-integer-arithmetic-paper.pdf>

    There is another whitepaper on using ADX for squaring numbers, but I
    cannot find that. Looking around what Erdinç Öztürk (aka Erdinc
    Ozturk) has also written, there's a patent "SIMD integer
    multiply-accumulate instruction for multi-precision arithmetic" from
    2016, so maybe Intel's thoughts are now into doing it with SIMD
    instead of with scalar instructions.

    Still, why deemphasize ADX? Do they want to drop it eventually? Why?
    They have to support separate renaming of C, O, and the other three
    because of instructions that go much farther back. The only way would
    be to provide alternatives to these instructions, and then deemphasize
    them over time, and eventually rename all flags together (and the old instrutions may then perform slowly).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 14:36:43 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.
    the same problem.

    Not quite.
    My idea was to have two HW threads HT1 and HT2 which are like x86 HW
    threads except when HT1 gets a TLB miss it stalls its execution and
    injects the TLB miss handler at the front of HT2 pipeline,
    and a HT2 TLB miss stalls itself and injects its handler into HT1.
    The TLB miss handler never itself TLB misses as it explicitly checks
    the TLB for any VA it needs to translate so recursion is not possible.

    As the handler is injected at the front of the pipeline no drain occurs.
    The only possible problem is if between when HT1 injects its miss handler
    into HT2 that HT2's existing pipeline code then also does a TLB miss.
    As this would cause a deadlock, if this occurs then it cores detects it
    and both HT fault and run their TLB miss handler themselves.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.
    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want
    atomicity there.

    To avoid race conditions with software clearing those bits, presumably.

    ARM64 originally didn't support hardware updates in V8.0, they were independent hardware features added to V8.1.

    Yes. A memory recycler can periodically clear the Accessed bit
    so it can detect page usage, and that might be a different core.
    But it might skip sending TLB shootdowns to all other cores
    to lower the overhead (maybe a lazy usage detector).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 16:41:39 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.

    Another advantage is the flexibility: you can implement any
    translation scheme you want: hierarchical page tables, inverted page
    tables, search trees, .... However, given that hierarchical page
    tables have won, this is no longer an advantage anyone cares for.

    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    On an OoO engine, I don't see that. The table walker software is
    called in its special context and the instructions in the table walker
    are then run through the front end and the OoO engine. Another table
    walk could be started at any time (even when the first table walk has
    not yet finished feeding its instructions to the front end), and once
    inside the OoO engine, the execution is OoO and concurrent anyway. It
    would be useful to avoid two searches for the same page at the same
    time, but hardware walkers have the same problem.

    Hmmm... I don't think that is possible, or if it is then its really hairy.
    The miss handler needs to LD the memory PTE's, which can happen OoO.
    But it also needs to do things like writing control registers
    (e.g. the TLB) or setting the Accessed or Dirty bits on the in-memory PTE, things that usually only occur at retire. But those handler instructions
    can't get to retire because the older instructions that triggered the
    miss are stalled.

    The miss handler needs general registers so it needs to
    stash the current content someplace and it can't use memory.
    Then add a nested miss handler on top of that.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    As Scott said, to avoid race conditions with software clearing those bits.
    Plus there might be PTE modifications that an OS could perform on other
    PTE fields concurrently without first acquiring the normal mutexes
    and doing a TLB shoot down of the PTE on all the other cores,
    provided they are done atomically so the updates of one core
    don't clobber the changes of another.

    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    Which poses the question: is it cheaper to implement n table walkers,
    or to add some resources and mechanism that allows doing SW table
    walks until the OoO engine runs out of resources, and a recovery
    mechanism in that case.

    A HW walker looks simple to me.
    It has a few bits of state number and a couple of registers.
    It needs to detect memory read errors if they occur and abort.
    Otherwise it checks each TLB level in backwards order using the
    appropriate VA bits, and if it gets a hit walks back down the tree
    reading PTE's for each level and adding them to their level TLB,
    checking it is marked present, and performing an atomic OR to set
    the Accessed and Dirty flags if they are clear.

    The HW walker is even simpler if the atomic OR is implemented directly
    in the cache controller as part of the Atomic Fetch And OP series.

    I see other performance and conceptual disadvantages for the envisioned
    SW walkers, however:

    1) The SW walker is inserted at the front end and there may be many
    ready instructions ahead of it before the instructions of the SW
    walker get their turn. By contrast, a hardware walker sits in the
    load/store unit and can do its own loads and stores with priority over
    the program-level loads and stores. However, it's not clear that
    giving priority to table walking is really a performance advantage.

    2) Some decisions will have to be implemented as branches, resulting
    in branch misses, which cost time and lead to all kinds of complexity
    if you want to avoid resetting the whole pipeline (which is the normal reaction to a branch misprediction).

    3) The reorder buffer processes instructions in architectural order.
    If the table walker's instructions get their sequence numbers from
    where they are inserted into the instruction stream, they will not
    retire until after the memory access that waits for the table walker
    is retired. Deadlock!

    It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
    it's probably easier to stay with hardware walkers.

    - anton

    Yes, and it seems to me that one would spend a lot more time trying to
    fix the SW walker than doing the simple HW walker that just works.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 19:17:01 2025
    From Newsgroup: comp.arch

    BGB wrote:
    On 8/17/2025 12:35 PM, EricP wrote:

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982
    https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.


    When code/density is the goal, a 16/32 RISC can do well.

    Can note:
    Maximizing code density often prefers fewer registers;
    For 16-bit instructions, 8 or 16 registers is good;
    8 is rather limiting;
    32 registers uses too many bits.

    I'm assuming 16 32-bit registers, plus a separate RIP.
    The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
    With just 16 registers there would be no zero register.

    The 4-bit register allows many 2-byte accumulate style instructions
    (where a register is both source and dest)
    8-bit opcode plus two 4-bit registers,
    or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.

    A flags register allows 2-byte short conditional branch instructions,
    8-bit opcode and 8-bit offset. With no flags register the shortest
    conditional branch would be 3 bytes as it needs a register specifier.

    If one is doing variable byte length instructions then
    it allows the highest usage frequency to be most compact possible.
    Eg. an ADD with 32-bit immediate in 6 bytes.

    Can note ISAs with 16 bit encodings:
    PDP-11: 8 registers
    M68K : 2x 8 (A and D)
    MSP430: 16
    Thumb : 8|16
    RV-C : 8|32
    SuperH: 16
    XG1 : 16|32 (Mostly 16)

    The saving for fixed 32-bit instructions is that it only needs to
    prefetch aligned 4 bytes ahead of the current instruction to maintain
    1 decode per clock.

    With variable length instructions from 1 to 12 bytes it could need
    a 16 byte fetch buffer to maintain that decode rate.
    And a 16 byte variable shifter (collapsing buffer) is much more logic.

    I was thinking the variable instruction buffer shifter could be built
    from tri-state buffers in a cross-bar rather than muxes.

    The difference for supporting variable aligned 16-bit instructions and
    byte aligned is that bytes doubles the number of tri-state buffers.

    In my recent fiddling for trying to design a pair encoding for XG3, can
    note the top-used instructions are mostly, it seems (non Ld/St):
    ADD Rs, 0, Rd //MOV Rs, Rd
    ADD X0, Imm, Rd //MOV Imm, Rd
    ADDW Rs, 0, Rd //EXTS.L Rs, Rd
    ADDW Rd, Imm, Rd //ADDW Imm, Rd
    ADD Rd, Imm, Rd //ADD Imm, Rd

    Followed by:
    ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
    ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
    ADDW Rd, Rs, Rd //ADDW Rs, Rd
    ADD Rd, Rs, Rd //ADD Rs, Rd
    ADDWU Rd, Rs, Rd //ADDWU Rs, Rd

    Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.

    For Load/Store:
    SD Rn, Disp(SP)
    LD Rn, Disp(SP)
    LW Rn, Disp(SP)
    SW Rn, Disp(SP)

    LD Rn, Disp(Rm)
    LW Rn, Disp(Rm)
    SD Rn, Disp(Rm)
    SW Rn, Disp(Rm)


    For registers, there is a split:
    Leaf functions:
    R10..R17, R28..R31 dominate.
    Non-Leaf functions:
    R10, R18..R27, R8/R9

    For 3-bit configurations:
    R8..R15 Reg3A
    R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

    Reg3B was a bit hacky, but had similar hit rates but uses less encoding space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 20 23:50:52 2025
    From Newsgroup: comp.arch

    On 8/20/2025 6:17 PM, EricP wrote:
    BGB wrote:
    On 8/17/2025 12:35 PM, EricP wrote:

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982
    https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most
    instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.


    When code/density is the goal, a 16/32 RISC can do well.

    Can note:
      Maximizing code density often prefers fewer registers;
      For 16-bit instructions, 8 or 16 registers is good;
      8 is rather limiting;
      32 registers uses too many bits.

    I'm assuming 16 32-bit registers, plus a separate RIP.
    The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
    With just 16 registers there would be no zero register.

    The 4-bit register allows many 2-byte accumulate style instructions
    (where a register is both source and dest)
    8-bit opcode plus two 4-bit registers,
    or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.


    Yeah.

    SuperH had:
    ZZZZnnnnmmmmZZZZ //2R
    ZZZZnnnniiiiiiii //2RI (Imm8)
    ZZZZnnnnZZZZZZZZ //1R


    For BJX2/XG1, had went with:
    ZZZZZZZZnnnnmmmm
    But, in retrospect, this layout was inferior to the one SuperH had used
    (and I almost would have just been better off doing a clean-up of the SH encoding scheme than moving the bits around).

    Though, this happened during a transition between B32V and BSR1, where:
    B32V was basically a bare-metal version of SH;
    BSR1 was an instruction repack (with tweaks to try make it more
    competitive with MSP430 while still remaining Load/Store);
    BJX2 was basically rebuilding all the stuff from BJX1 on top of BSR1's encoding scheme (which then mutated more).


    At first, BJX2's 32-bit ops were a prefix:
    111P-YYWY-qnmo-oooo ZZZZ-ZZZZ-nnnn-mmmm

    But, then got reorganized:
    111P-YYWY-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ

    Originally, this repack was partly because I had ended up designing some Imm9/Disp9 encodings as it quickly became obvious that Imm5/Disp5 was insufficient. But, I had designed the new instructions to have the Imm
    field not be totally dog-chewed, so ended up changing the layout. Then
    ended up changing the encoding for the 3R instructions to better match
    that of the new Imm9 encodings.

    Then, XG2:
    NMOP-YYwY-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ //3R

    Which moved entirely over to 32/64/96 bit encodings in exchange for
    being able to directly encode 64 GPRs in 32-bit encodings for the whole ISA.


    In the original BJX2 (later renamed XG1), only a small subset having
    direct access to the higher numbered registers; and other instructions
    using 64-bit encodings.

    Though, ironically, XG2 never surpassed XG1 in terms of code-density;
    but being able to use 64 registers "pretty much everywhere" was (mostly)
    a good thing for performance.


    For XG3, there was another repack:
    ZZZZ-oooooo-mmmmmm-ZZZZ-nnnnnn-qY-YYPw //3R

    But, this was partly to allow it to co-exist with RISC-V.

    Technically, still has conditional instructions, but these were demoted
    to optional; as if one did a primarily RISC-V core, with an XG3 subset
    as an ISA extension, they might not necessarily want to deal with the
    added architectural state of a 'T' bit.

    BGBCC doesn't currently use it by default.

    Was also able to figure out how to make the encoding less dog chewed
    than either XG2 or RISC-V.


    Though, ironically, the full merits of XG3 are only really visible in
    cases where XG1 and XG2 are dropped. But, it has a new boat-anchor in
    that it now assumes coexistence with RISC-V (which itself has a fair bit
    of dog chew).

    And, if the goal is RISC-V first, then likely the design of XG3 is a big
    ask; it being essentially its own ISA.

    Though, while giving fairly solid performance, XG3 currently hasn't
    matched the code density of its predecessors (either XG1 or XG2). It is
    more like "RISC-V but faster".

    And, needing to use mode changes to access XG3 or RV-C is a little ugly.



    Though, OTOH, RISC-V land is annoying in a way; lots of people being
    like "RV-V will save us from all our performance woes!". Vs, realizing
    that some issues need to be addressed in the integer ISA, and SIMD and auto-vectorization will not address inefficiencies in the integer ISA.


    Though, I have seen glimmers of hope that other people in RV land
    realize this...


    A flags register allows 2-byte short conditional branch instructions,
    8-bit opcode and 8-bit offset. With no flags register the shortest conditional branch would be 3 bytes as it needs a register specifier.


    Yeah, "BT/BF Disp8".


    If one is doing variable byte length instructions then
    it allows the highest usage frequency to be most compact possible.
    Eg. an ADD with 32-bit immediate in 6 bytes.



    In BSR1, I had experimented with:
    LDIZ Imm12u, R0 //R0=Imm12
    LDISH Imm8u //R0=(R0<<8)|Umm8u
    OP Imm4R, Rn //OP [(R0<<4)|Imm4u], Rn

    Which allowed Imm24 in 6 bytes or Imm32 in 8 bytes.
    Granted, as 3 or 4 instructions.

    Though, this began the process of allowing the assembler to fake more
    complex instructions which would decompose into simpler instructions.


    But, this was not kept, and in BJX2 was mostly replaced with:
    LDIZ Imm24u, R0
    OP R0, Rn

    Then, when I added Jumbo Prefixes:
    OP Rm, Imm33s, Rn

    Some extensions of RISC-V support Imm32 in 48-bit ops, but this burns
    through lots of encoding space.

    iiiiiiii-iiiiiiii iiiiiiii-iiiiiiii zzzz-nnnnn-z0-11111

    This doesn't go very far.


    Can note ISAs with 16 bit encodings:
      PDP-11: 8 registers
      M68K  : 2x 8 (A and D)
      MSP430: 16
      Thumb : 8|16
      RV-C  : 8|32
      SuperH: 16
      XG1   : 16|32 (Mostly 16)

    The saving for fixed 32-bit instructions is that it only needs to
    prefetch aligned 4 bytes ahead of the current instruction to maintain
    1 decode per clock.

    With variable length instructions from 1 to 12 bytes it could need
    a 16 byte fetch buffer to maintain that decode rate.
    And a 16 byte variable shifter (collapsing buffer) is much more logic.

    I was thinking the variable instruction buffer shifter could be built
    from tri-state buffers in a cross-bar rather than muxes.

    The difference for supporting variable aligned 16-bit instructions and
    byte aligned is that bytes doubles the number of tri-state buffers.


    If the smallest instruction size is 16 bits, it simplifies things
    considerably vs 8 bits.

    If the smallest size is 32-bits, it simplifies things even more.
    Fixed length is the simplest case though.


    As noted, 32/64/96 bit fetch isn't too difficult though.

    For 64/96 bit instructions though, mostly want to be able to (mostly)
    treat it like a superscalar fetch of 2 or 3 32-bit instructions.

    In my CPU, I ended up making it so that only 32-bit instructions support superscalar; whereas 16 and 64/96 bit instructions are scalar only.

    Superscalar only works with native alignment though (for RISC-V), and
    for XG3, 32-bit instruction alignment is mandatory.


    As noted, in terms of code density, a few of the stronger options are
    Thumb2 and RV-C, which have 16 bits as the smallest size.


    I once experimented with having a range of 24-bit instructions, but the
    hair this added (combined with the fairly little gain in terms of code density) showed this was rather not worth it.


    ...


    In my recent fiddling for trying to design a pair encoding for XG3,
    can note the top-used instructions are mostly, it seems (non Ld/St):
      ADD   Rs, 0, Rd    //MOV     Rs, Rd
      ADD   X0, Imm, Rd  //MOV     Imm, Rd
      ADDW  Rs, 0, Rd    //EXTS.L  Rs, Rd
      ADDW  Rd, Imm, Rd  //ADDW    Imm, Rd
      ADD   Rd, Imm, Rd  //ADD     Imm, Rd

    Followed by:
      ADDWU Rs, 0, Rd    //EXTU.L  Rs, Rd
      ADDWU Rd, Imm, Rd  //ADDWu   Imm, Rd
      ADDW  Rd, Rs, Rd   //ADDW    Rs, Rd
      ADD   Rd, Rs, Rd   //ADD     Rs, Rd
      ADDWU Rd, Rs, Rd   //ADDWU   Rs, Rd

    Most every other ALU instruction and usage pattern either follows a
    bit further behind or could not be expressed in a 16-bit op.

    For Load/Store:
      SD  Rn, Disp(SP)
      LD  Rn, Disp(SP)
      LW  Rn, Disp(SP)
      SW  Rn, Disp(SP)

      LD  Rn, Disp(Rm)
      LW  Rn, Disp(Rm)
      SD  Rn, Disp(Rm)
      SW  Rn, Disp(Rm)


    For registers, there is a split:
      Leaf functions:
        R10..R17, R28..R31 dominate.
      Non-Leaf functions:
        R10, R18..R27, R8/R9

    For 3-bit configurations:
      R8..R15                             Reg3A
      R18/R19, R20/R21, R26/R27, R10/R11  Reg3B

    Reg3B was a bit hacky, but had similar hit rates but uses less
    encoding space than using a 4-bit R8..R23 (saving 1 bit on the
    relevant scenarios).




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 21 16:21:37 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    Consider "virgin" page, that is neither accessed nor modified.
    Intruction 1 reads the page, instruction 2 modifies it. After
    both are done you should have both bits set. But if miss handling
    for instruction 1 reads page table entry first, but stores after
    store fomr instruction 2 handler, then you get only accessed bit
    and modified flag is lost. Symbolically we could have

    read PTE for instruction 1
    read PTE for instruction 2
    store PTE for instruction 2 (setting Accessed and Modified)
    store PTE for instruction 1 (setting Accessed but clearing Modified)
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Aug 21 19:26:47 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Aug 21 21:48:03 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:

    I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
    My 66000 calls this mode of operation "safe stack".

    This sounds like an idea worth stealing, although no doubt the way I
    would attempt to copy it would be a failure which removed all the
    usefulness of it.

    For one thing, I don't have a stack for calling subroutines, or any other purpose.

    But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.

    The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
    of anywhere else to save them. (There might be multiple sets of registers, for things like SMT, but *not* for user vs supervisor or anything like
    that.)

    In reverse order:
    If TI 9900 used their registers like a write-back cache, then typical access would be fast and efficient. When the register pointer is altered, the old file is written en-massé and a new file is read in en-massé {possibly with some buffering to lessen visible cycle count} ... but I digress.

    {Conceptually}
    My 66000 uses this concept for its GPRs and for its Thread State but only at context switch time, not for subroutine calls and returns. HW saves and restores Thread State and Registers on context switches so that the CPU
    never has to Disable Interrupts (it can, it just doesn't have to). {/Conceptually}
    I bracketed the above with 'Conceptually' because it is completely
    reasonable to envision a Core big enough to have 4 copies of Thread
    State and Register files, and bank switch between them. The important properties are that the switch delivers control reentrantly, HOW any
    given implementation does that is NOT architecture--that it does IS architecture.

    I specifically left how many registers are preserved to SW per CALL
    because up to 50% need 0, and few % require more than 4. This appears
    to indicate that SPARC using 8 was overkill ... but I digress again.

    Safe Stack is a stack used for preserving the ABI contract between caller
    and callee even in the face of buffer overruns, RoP, and other malicious program behavior. SS places the return address and the preserved registers
    in an area of memory where LD and ST instructions have no access (RWE = 000) but ENTER, EXIT, and RET do. This was done in such a way that correct code
    runs both with SS=on and SS=off, so the compiler does not have to know.

    Only CALL, CALX, RET, ENTER, and EXIT are aware of the existence of SS
    and only in HW implementations.

    I have harped on you for a while to start development of your compiler.
    One of the first things a compiler needs to do is to develop its means
    to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.

    I happen to use 2 such stacks mostly to harden the environment at low
    cost to malicious attack vectors. It comes with benefits: Lines removed
    from SS do not migrate to L2 or even DRAM, they can be discarded at
    end-of-use, reducing memory traffic; the SW contract between Caller and
    Callee is guaranteed even in the face of malicious code; it can be used
    as a debug tool to catch malicious code. ...

    NOTE: malicious code can still damage data*, just not the preserved regs,
    the return address, guaranteeing that control returns to the instruction following CALL. And all without adding a single instruction to the CALL
    RET instruction sequence.

    (*) memory

    So I've probably completely misunderstood you here.

    Not the first time ...

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Aug 22 16:36:09 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.

    1 mln transistors is an upper estimate. But low numbers given
    for early RISC chips are IMO misleading: RISC become comercialy
    viable for high-end machines only in later generations, when
    designers added a few "expensive" instructions. Also, to fit
    design into a single chip designers moved some functionality
    like bus interface to support chips. RISC processor with
    mixed 16-32 bit instructions (needed to get resonable code
    density), hardware multiply and FPU, including cache controller,
    paging hardware and memory controller is much more than
    100 thousend transitors cited for early workstation chips.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Aug 22 16:45:56 2025
    From Newsgroup: comp.arch

    According to Thomas Koenig <tkoenig@netcologne.de>:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors
    mounted a few to a package. The /91 was big but it wasn't *that* big.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Aug 22 17:21:17 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.

    1 mln transistors is an upper estimate. But low numbers given
    for early RISC chips are IMO misleading: RISC become comercialy
    viable for high-end machines only in later generations, when
    designers added a few "expensive" instructions.

    Like the multiply instruction in ARM2.

    Also, to fit
    design into a single chip designers moved some functionality
    like bus interface to support chips. RISC processor with
    mixed 16-32 bit instructions (needed to get resonable code
    density), hardware multiply and FPU, including cache controller,
    paging hardware and memory controller is much more than
    100 thousend transitors cited for early workstation chips.

    Yep, FP support can be expensive and was an extra option
    on the VAX, which also included integer multiply.

    However, I maintain that a ~1977 supermini with a similar sort
    of bus, MMU, floating point unit etc like the VAX, but with an
    architecture similar to ARM2, plus separate icache and dcache, would
    have beaten the VAX hands-down in performance - it would have taken
    fewer chips to implement, less power and possibly time to develop.
    HP showed this was possible some time later.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 23 08:51:34 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    I have harped on you for a while to start development of your compiler.
    One of the first things a compiler needs to do is to develop its means
    to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.

    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    I think we discussed this for My 66000 some time ago, but there
    is no resolution as yet.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Aug 23 16:38:47 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors mounted a few to a package. The /91 was big but it wasn't *that* big.

    I remember this number, but do not remember where I found it. So
    it may be wrong.

    However, one can estimate possible density in a different way: package
    probably of similar dimensions as VAX package can hold about 100 TTL
    chips. I do not have detailed data about chip usage and transistor
    couns for each chip. Simple NAND gate is 4 transitors, but input
    transitor has two emiters and really works like two transistors
    so it is probably better to count it as 2 transitors, and conseqently
    consisder 2 input NAND gate as having 5 transitors. So 74S00 gives
    20 transistors. D-flop probably is about 20-30 transitors, so
    74S74 is probably around 40-60. Quad D-flop bring us close to 100.
    I suspect that in VAX time octal D-flops were available. There
    were 4 bit ALU slices. Also multiplexers need nontrivial number
    of transistors. So I think that 50 transistors is reasonable (maybe
    low) estimate of average density. Assuming 50 transitors per chip
    that would be 5000 transistors per package. Packages were rather
    flat, so when mounted vertically one probably could allocate 1 cm
    of horizotal space for each. That would allow 30 packages at
    single level. With 7 levels we get 210 packages, enough for
    1 mln transistors.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Aug 23 19:36:47 2025
    From Newsgroup: comp.arch

    According to Waldek Hebisch <antispam@fricas.org>:
    John Levine <johnl@taugh.com> wrote:
    It's also seems rather high for the /91. I can't find any authoritative
    numbers but 100K seems more likely. It was SLT, individual transistors
    mounted a few to a package. The /91 was big but it wasn't *that* big.

    I remember this number, but do not remember where I found it. So
    it may be wrong.

    However, one can estimate possible density in a different way: package >probably of similar dimensions as VAX package can hold about 100 TTL
    chips. I do not have detailed data about chip usage and transistor
    couns for each chip. Simple NAND gate is 4 transitors, but input
    transitor has two emiters and really works like two transistors
    so it is probably better to count it as 2 transitors, and conseqently >consisder 2 input NAND gate as having 5 transitors. So 74S00 gives
    20 transistors. D-flop probably is about 20-30 transitors, so
    74S74 is probably around 40-60. Quad D-flop bring us close to 100.
    I suspect that in VAX time octal D-flops were available. There
    were 4 bit ALU slices. Also multiplexers need nontrivial number
    of transistors. So I think that 50 transistors is reasonable (maybe
    low) estimate of average density. Assuming 50 transitors per chip
    that would be 5000 transistors per package. Packages were rather
    flat, so when mounted vertically one probably could allocate 1 cm
    of horizotal space for each. That would allow 30 packages at
    single level. With 7 levels we get 210 packages, enough for
    1 mln transistors.

    I don't see what this could have to do with the 360/91. As I said,
    it was built with SLT, a few individual transistors and resistors
    per package.

    IBM's first integrated circuits were MST used in the 360/85 and System/3.
    Pugh et al are kind of vague about how many transistors per chip but they
    show an exemplar circuit with six transistors and say there were up to four
    per chip so that's still only two dozen transistors per chip.

    I realize that densities got a lot greater, but that was S/370 and beyond.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Aug 25 00:56:26 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }
    arrays:
    MOV Ri,#0
    MOV Rr,#0
    VEC Rt,{}
    LDD Rl,[Rv,Ri<<3]
    ADD Rr,Rr,Rl
    LOOP LT,Ri,Rn,#1
    MOV R1,Rr
    RET

    7 instructions, 1 instruction-modifier; 8 words.

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD 0x1234567890abcdef,[IP,a]
    STD 0xcdef1234567890ab,[IP,b]
    STD 0x567890abcdef1234,[IP,c]
    STD 0x5678901234abcdef,[IP,d]
    RET

    5 instructions, 13 words, 0 .data, 0 .bss

    gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:

    0000000000010434 <arrays>:
    10434: cd81 beqz a1,1044c <arrays+0x18>
    10436: 058e slli a1,a1,0x3
    10438: 87aa mv a5,a0
    1043a: 00b506b3 add a3,a0,a1
    1043e: 4501 li a0,0
    10440: 6398 ld a4,0(a5)
    10442: 07a1 addi a5,a5,8
    10444: 953a add a0,a0,a4
    10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
    1044a: 8082 ret
    1044c: 4501 li a0,0
    1044e: 8082 ret

    0000000000010450 <globals>:
    10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
    10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
    10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
    1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
    10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
    10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
    10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
    1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
    10470: 8082 ret

    When using -Os, arrays becomes 2 bytes shorter, but the inner loop
    becomes longer.

    gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following AMD64 code:

    000000001139 <arrays>:
    1139: 48 85 f6 test %rsi,%rsi
    113c: 74 13 je 1151 <arrays+0x18>
    113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
    1142: 31 c0 xor %eax,%eax
    1144: 48 03 07 add (%rdi),%rax
    1147: 48 83 c7 08 add $0x8,%rdi
    114b: 48 39 d7 cmp %rdx,%rdi
    114e: 75 f4 jne 1144 <arrays+0xb>
    1150: c3 ret
    1151: 31 c0 xor %eax,%eax
    1153: c3 ret

    000000001154 <globals>:
    1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
    115b: 56 34 12
    115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a>
    1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
    116c: 12 ef cd
    116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b>
    1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
    117d: 90 78 56
    1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c>
    1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
    118e: 90 78 56
    1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d>
    1198: c3 ret

    gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following ARM A64 code:

    0000000000000734 <arrays>:
    734: b4000121 cbz x1, 758 <arrays+0x24>
    738: aa0003e2 mov x2, x0
    73c: d2800000 mov x0, #0x0 // #0
    740: 8b010c43 add x3, x2, x1, lsl #3
    744: f8408441 ldr x1, [x2], #8
    748: 8b010000 add x0, x0, x1
    74c: eb03005f cmp x2, x3
    750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
    754: d65f03c0 ret
    758: d2800000 mov x0, #0x0 // #0
    75c: d65f03c0 ret

    0000000000000760 <globals>:
    760: d299bde2 mov x2, #0xcdef // #52719
    764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
    768: f2b21562 movk x2, #0x90ab, lsl #16
    76c: 9100e020 add x0, x1, #0x38
    770: f2cacf02 movk x2, #0x5678, lsl #32
    774: d2921563 mov x3, #0x90ab // #37035
    778: f2e24682 movk x2, #0x1234, lsl #48
    77c: f9001c22 str x2, [x1, #56]
    780: d2824682 mov x2, #0x1234 // #4660
    784: d299bde1 mov x1, #0xcdef // #52719
    788: f2aacf03 movk x3, #0x5678, lsl #16
    78c: f2b9bde2 movk x2, #0xcdef, lsl #16
    790: f2a69561 movk x1, #0x34ab, lsl #16
    794: f2c24683 movk x3, #0x1234, lsl #32
    798: f2d21562 movk x2, #0x90ab, lsl #32
    79c: f2d20241 movk x1, #0x9012, lsl #32
    7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
    7a4: f2eacf02 movk x2, #0x5678, lsl #48
    7a8: f2eacf01 movk x1, #0x5678, lsl #48
    7ac: a9008803 stp x3, x2, [x0, #8]
    7b0: f9000c01 str x1, [x0, #24]
    7b4: d65f03c0 ret

    So, the overall sizes (including data size for globals() on RV64GC) are:

    arrays globals Architecture
    28 66 (34+32) RV64GC
    27 69 AMD64
    44 84 ARM A64

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test. Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions, so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.

    NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
    advantage of RV32GC over RV64GC there:

    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64


    I guess it can be noted, is the overhead of any ELF metadata being >excluded?...

    These are sizes of the .text section extracted with objdump -h. So
    no, these numbers do not include ELF metadata, nor the sizes of other sections. The latter may be relevant, because RV64GC has "immediates"
    in .sdata that other architectures have in .text; however, .sdata can
    contain other things than just "immediates", so one cannot just add the .sdata size to the .text size.

    Granted, newer compilers do support newer versions of the C standard,
    and also typically get better performance.

    The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
    from auto-vectorization).

    There is one other improvement: gcc register allocation has improved
    in recent years to a point where we 1) no longer need explicit
    register allocation for Gforth on AMD64, and 2) with a lot of manual
    help, we could increase the number of stack cache registers from 1 to
    3 on AMD64, which gives some speedups typically in the 0%-20% range in Gforth.

    But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
    which is vectorizable, I still have not been able to get gcc to auto-vectorize it, even with some transformations which should help.
    I have not measured the scalar versions again, but given that there
    were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
    I doubt that I will see consistent speedups with newer gcc (or clang) versions.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Aug 26 21:46:24 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 7/28/2025 6:18 PM, John Savard wrote:
    On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:

    VAX tried too hard in my opinion to close the semantic gap.
    Any operand could be accessed with any address mode. Now while this
    makes the puny 16-register file seem larger,
    what VAX designers forgot, is that each address mode was an instruction
    in its own right.

    So, VAX shot at minimum instruction count, and purposely miscounted
    address modes not equal to %k as free.

    Fancy addressing modes certainly aren't _free_. However, they are,
    in my opinion, often cheaper than achieving the same thing with an
    extra instruction.

    So it makes sense to add an addressing mode _if_ what that addressing
    mode does is pretty common.


    The use of addressing modes drops off pretty sharply though.

    Like, if one could stat it out, one might see a static-use pattern
    something like:
    80%: [Rb+disp]
    15%: [Rb+Ri*Sc]
    3%: (Rb)+ / -(Rb)
    1%: [Rb+Ri*Sc+Disp]
    <1%: Everything else

    Since RISC-V only has [Rb+dips12] the other 20% take at least 2 instructions. Simple math indicates this requires 1.2+ instructions/mem-ref instead of 1.0 instructions/mem-ref. disp12 does not help either.

    My 66000 does not have (Rb)+ or -(Rb), and most RISC-machines don't either.
    On the other hand, I see more [Rb+Ri<<s+disp] than 1%--more like 3%-4%--
    this is partially due to using indexing than incrementation when doing
    loops::

    MOV Ri,#0
    VEC R15,{}
    LDD R9,[R3,Ri<<3+disp]
    calk
    LOOP LT,Ri,#1,Rn
    instead of:
    MOV Ri,#0
    LDA R14,[R3+disp]
    VEC R15,{}
    LDD R9,(R14)+
    calk
    LOOP LT,Ri,#1,Rn
    {and the second loop has an additional ADD in it}

    Though, I am counting [PC+Disp] and [GP+Disp] as part of [Rb+Disp] here.

    Granted, the dominance of [Rb+Disp] does drop off slightly when
    considering dynamic instruction use. Part of it is due to the
    prolog/epilog sequences.

    I have a lot of [IP,DISP] due to the way the compile places data.

    If one had instead used (SP)+ and -(SP) addressing for prologs and
    epilogs, then one might see around 20% or so going to these instead.
    Or, if one had PUSH/POP, to PUSH/POP.

    ENTER and EXIT compress prologues and epilogues to a single instruction
    each. They also have the option of placing the preserved registers in
    a place where the called subroutine cannot damage them.

    The discrepancy then between static and dynamic instruction counts them being mostly due to things like loops and similar.

    Estimating the effect of loops in a compiler is hard, but had noted that assuming a scale factor of around 1.5^D for loop nesting levels (D)
    seemed to be in the area. Many loops end up unreached in many
    iterations, or only running a few times, so possibly counter-intuitively
    it is often faster to assume that a loop body will likely only cycle 2
    or 3 times rather than 100s or 1000s, and trying to aggressively
    optimize loops by assuming large N tends to be detrimental to performance.

    VAX compilers set the loop-count = 10 and did OK for their era. A
    low count (like 10) ameliorates the small loops (letters in a name)
    against the larger loops like Matrix300.

    Well, and at least thus far, profiler-driven optimization isn't really a thing in my case.


    -----------------------

    One could maybe argue for some LoadOp instructions, but even this is debatable. If the compiler is designed mostly for Load/Store, and the
    ISA has a lot of registers, the relative benefit of LoadOp is reduced.

    LoadOp being mostly a benefit if the value is loaded exactly once, and
    there is some other ALU operation or similar that can be fused with it.

    Practically, it limits the usefulness of LoadOp mostly to saving an instruction for things like:
    z=arr[i]+x;


    But, the relative incidence of things like this is low enough as to not
    save that much.

    The other thing is that one has to implement it in a way that does not increase pipeline length,

    This is the key point about LD-OPs:: if you build a pipeline to support
    them, then you will suffer when instruction stream is independent RISC-
    like instructions--conversely; if you build the pipeline for RISC-like instructions, LD-OPs take a penalty unless you by off on Medium OoO, at
    least.

    since if one makes the pipeline linger for
    sake of LoadOp or OpStore, then this is likely to be a net negative for performance vs prioritizing Load/Store, unless the pipeline had already needed to be lengthened for other reasons.

    And thus, this is why RISC-machines largely avoid LD-OPs.

    One can be like, "But what if the local variables are not in registers?"
    but on a machine with 32 or 64 registers, most likely your local
    variable is already going to be in a register.

    So, the main potential merit of LoadOp being "doesn't hurt as bad on a register-starved machine".

    So does poking your eye with a hot knife.

    That being said, though, designing a new machine today like the VAX
    would be a huge mistake.

    But the VAX, in its day, was very successful. And I don't think that
    this was just a result of riding on the coattails of the huge popularity
    of the PDP-11. It was a good match to the technology *of its time*,
    that being machines that were implemented using microcode.


    Yeah.

    There are some living descendants of that family, but pretty much
    everything now is Reg/Mem or Load/Store with a greatly reduced set of addressing modes.


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Aug 27 00:35:18 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.

    Another aspect from those measurements is that the 68k instruction set
    (with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

    Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
    program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and
    PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.

    DEC probably was aware from the work of William Wulf and his students
    what optimizing compilers can do and how to write them. After all,
    they used his language BLISS and its compiler themselves.

    Much greater than just "well aware" there were at least 15 grad students
    at CMU working on optimizing compilers AND the VAX ISA; as well as Wulf, Newell, and Bell leading the pack.

    POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    Hold on a minute:: My Transcendentals are done in POLY-like fashion,
    it is just that the constants come from ROM inside the FPU, instead
    of user defined DRAM coefficients. Thus, POLY is good, POLY as an
    instruction is bad.

    Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting
    one address modify another in the same instruction, would have made it a lot >easier to pipeline.

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
    to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
    given us.

    Compilers have taught us that one-address-mode per instruction is
    "sufficient" {if you are going to have address modes.}

    My work on My 66000 has taught me that 1 constant per instruction
    is nearly sufficient. The only places I break this is ST #val[disp]
    and LOOP cnd,Ri,#inc,#max.

    Pipeline work over 1983-to-current has shown that LD and OPs perform
    just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
    were LD and OP, and there are way to perform LD and OP as if it were
    LD+OP.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
    similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    Condition codes get hard when DECODE width grows greater than 3.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Aug 27 00:56:58 2025
    From Newsgroup: comp.arch


    antispam@fricas.org (Waldek Hebisch) posted:
    -----------snip--------------
    If VAX designers could not afford pipeline, than it is
    not clear if RISC could afford it: removing microcode
    engine would reduce complexity and cost and give some
    free space. But microcode engines tend to be simple.

    Witness Mc 68000, Mc 68010, and Mc 68020. In all these
    designs, the microcode and its surrounding engine took
    1/2 of the die-area insides the pins.

    In 1980 it was possible to put the data path of a 32-bit
    ISA on one die and pipeline it, but runs out of area when
    you put microcode on the same die (area). Thus, RISC was
    born. Mc88100 had a decoder and sequencer that was 1/8
    of the interior area of the chip and had 4 FUs {Int,
    Mem, MUL, and FADD} all pipelined.

    Also, PDP-11 compatibility depended on microcode.

    Different address modes mainly.

    Without microcode engine one would need parallel set
    of hardware instruction decoders, which could add
    more complexity than was saved by removing microcode
    engine.

    To summarize, it is not clear to me if RISC in VAX technology
    could be significantly faster than VAX especially given constraint
    of PDP-11 compatibility.

    RISC in MSI TTL logic would not have worked all that well.

    OTOH VAX designers probably felt
    that CISC nature added significant value: they understood
    that cost of programming was significant and believed that
    ortogonal instruction set, in particular allowing complex
    addresing on all operands made programming simpler.

    Some of us RISC designers believe similarly {about orthogonal
    ISA not about address modes.}

    They
    probably thought that providing resonably common procedures
    as microcoded instructions made work of programmers simpler
    even if routines were only marginally faster than ordinary
    code.

    We think similarly--but we do not accept µCode being slower
    that SW ISA, or especially compiled HLL.

    Part of this thinking was probably like "future
    system" motivation at IBM: Digital did not want to produce
    "commodity" systems, they wanted something with unique
    features that custemer will want to use.

    s/used/get locked in on/

    Without
    isight into future it is hard to say that they were
    wrong.

    It is hard to argue that they made ANY mistakes with
    what we know about the world of computers circa 1977.

    It is not hard in 2025.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Aug 27 01:01:21 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

    If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
    cases, one gets most of the benefit with less issues.

    That is actually what we did on Mc88100, and while a lot better than
    just [Base+Disp] it is still not as good as [RB+Ri<<s+Disp]; the later
    saving instructions that merely create constants.

    That's certainly a way to do it. But then you either need to dedicate
    one base register to each array - perhaps easier if there's opcode
    space to use all 32 registers as base registers, which this would allow -
    or you would have to load the base register with the address of the
    array.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 27 05:12:57 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    10.5 on a characteristic mix, actually.

    See "A Characterization of Processor Performance in the VAX-11/780"
    by Emer and Clark, their Table 8.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 27 10:56:31 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    10.5 on a characteristic mix, actually.

    See "A Characterization of Processor Performance in the VAX-11/780"
    by Emer and Clark, their Table 8.

    Going through the VAX 780 hardware schematics and various performance
    papers, near as I can tell it took *at least* 1 clock per instruction byte
    for decode, plus any I&D cache miss and execute time, as it appears to
    use microcode to pull bytes from the 8-byte instruction buffer (IB)
    *one at a time*.

    So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.

    And I say "at least" 1 C/IB as I am not including any micro-pipeline stalls. The microsequencer has some pipelining, overlap read of the next uWord
    with execute of current, which would introduce a branch delay slot into
    the microcode. As it uses the opcode and operand bytes to do N-way jump/call
    to uSubroutines, each of those dispatches might have a branch delay slot too.

    (Similar issues appear in the MV-8000 uSequencer except it appears to
    have 2 or maybe 3 microcode branch delay slots).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 27 17:19:06 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.
    ...
    [...] POLY as an
    instruction is bad.

    Exactly.

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    It's better to forget this misinformation, and instead remember that
    the VAX has an average CPI of 10.6 (Table 8 of <https://american.cs.ucdavis.edu/academic/readings/papers/p301-emer.pdf>)

    Table 9 of that reference is also interesting:

    CALL/RET instructions take an average 45 cycles, Character
    instructions (I guess this means stuff like EDIT) takes an average 117
    cycles, and Decimal instructions take an average 101 cycles. It seems
    that these instructions all have no special hardware support on the
    VAX 11/780 and do it all through microcode. So replacing Character
    and Decimal instructions with calls to functions on a RISC-VAX could
    easily outperform the VAX 11/780 even without special hardware
    support. Now add decimal support like the HPPA has done or string
    support like the Alpha has done, and you see even better speed for
    these instructions.

    For CALL/RET, one might use one of the modern calling conventions.
    However, this loses some capabilities compared to the VAX. So one may
    prefer to keep frame pointers by default and maybe other features that
    allow, e.g., universal cross-language debugging on the VAX without monstrosities like ELF and DWARF.

    Pipeline work over 1983-to-current has shown that LD and OPs perform
    just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
    were LD and OP, and there are way to perform LD and OP as if it were
    LD+OP.

    I don't know what you are getting at here. When implementing the 486,
    Intel chose the following pipeline:

    Instruction Fetch
    Instruction Decode
    Mem1
    Mem2/OP
    Writeback

    This meant that load-and-op instructions take 2 cycles (and RMW
    instructions take three); it gave us the address-generation interlock (op-to-load latency 2), and 3-cycle taken branches. An alternative
    would have been:

    Instruction Fetch
    Instruction Decode
    Mem1
    Mem2
    OP
    Writeback

    This would have resultet in a max throughput of 1 CPI for sequences of load-and-op instructions, but would have resultet in an AGI of 3
    cycles, and 4-cycle taken branches.

    For the Bonnell Intel chose such a pipeline (IIRC with a third mem
    stage), but the Bonnell has a branch predictor, so the longer branch
    latency usually does not strike.

    AFAIK IBM used such a pipeline for some S/360 descendants.

    Condition codes get hard when DECODE width grows greater than 3.

    And yet the widest implementations (up to 10 wide up to now) are of
    ISAs that have condition-code registers. Even particularly nasty ones
    in the case of AMD64.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 28 07:49:31 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    antispam@fricas.org (Waldek Hebisch) posted:
    -----------snip--------------
    If VAX designers could not afford pipeline, than it is
    not clear if RISC could afford it: removing microcode
    engine would reduce complexity and cost and give some
    free space. But microcode engines tend to be simple.

    Witness Mc 68000, Mc 68010, and Mc 68020. In all these
    designs, the microcode and its surrounding engine took
    1/2 of the die-area insides the pins.

    Note that most of this is microcode ROM. They complicated
    logic to get smaller ROM size. For VAX it was quite different:
    microcode memory (and cache) were build from LSI chips,
    not suitable for logic at that time. Assuming 6 transistor
    static RAM cells VAX had 590000 transistors in microcode memory
    chips (and another 590000 transistors in cache chips).
    Comparatively one can estimate VAX logic chips as between 20000
    and 100000 transistors, with low numbers looking more likely
    to me. IIUC at least early VAX on a "single" chip were slowed
    down by going to off-chip microcode memory.

    In 1980 it was possible to put the data path of a 32-bit
    ISA on one die and pipeline it, but runs out of area when
    you put microcode on the same die (area). Thus, RISC was
    born. Mc88100 had a decoder and sequencer that was 1/8
    of the interior area of the chip and had 4 FUs {Int,
    Mem, MUL, and FADD} all pipelined.

    Yes, but IIUC big item was on-chip microcode memory (or pins
    needed to go to external microcode memory).
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Aug 28 15:10:55 2025
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    It appears that Waldek Hebisch <antispam@fricas.org> said:
    My idea was that instruction decoder could essentially translate

    ADDL (R2)+, R2, R3

    into

    MOV (R2)+, TMP
    ADDL TMP, R2, R3

    But how about this?

    ADDL3 (R2)+,(R2)+,(R2)+

    Now you need at least two temps, the second of which depends on the
    first, and there are instructions with six operands. Or how about
    this:

    ADDL3 (R2)+,#1234,(R2)+

    This is encoded as

    OPCODE (R2)+ (PC)+ <1234> (R2)+

    The immediate word is in the middle of the instruction. You have to decode the operands one at a time so you can recognize immediates and skip over them.
    It must have seemed clever at the time, but ugh.


    What we must all realize is that each address mode in VAX was a microinstruction all unto itself.

    And that is why it was not pipelineable in any real sense.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Aug 28 13:39:54 2025
    From Newsgroup: comp.arch

    EricP wrote:
    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    10.5 on a characteristic mix, actually.

    See "A Characterization of Processor Performance in the VAX-11/780"
    by Emer and Clark, their Table 8.

    Going through the VAX 780 hardware schematics and various performance
    papers, near as I can tell it took *at least* 1 clock per instruction byte for decode, plus any I&D cache miss and execute time, as it appears to
    use microcode to pull bytes from the 8-byte instruction buffer (IB)
    *one at a time*.

    So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.

    And I say "at least" 1 C/IB as I am not including any micro-pipeline
    stalls.
    The microsequencer has some pipelining, overlap read of the next uWord
    with execute of current, which would introduce a branch delay slot into
    the microcode. As it uses the opcode and operand bytes to do N-way
    jump/call
    to uSubroutines, each of those dispatches might have a branch delay slot too.

    (Similar issues appear in the MV-8000 uSequencer except it appears to
    have 2 or maybe 3 microcode branch delay slots).

    I found a description of the 780 instruction buffer parser
    in the Data Path description on bitsavers and
    it does in fact pull one operand specifier from IB per clock.
    There is a mux network to handle various immediate formats in parallel,

    There are conflicting descriptions as to exactly how it handles the
    first operand, whether that is pulled with the opcode or in a separate clock, as the IB shifter can only do 1 to 5 byte shifts but an opcode with
    a first operand with 32-bit displacement would be 6 bytes.

    But basically it takes 1 clock for the opcode byte and the first operand specifier byte, a second clock if the first opspec has an immediate,
    then 1 clock for each subsequent operand specifier.
    If an operand has an immediate it is extracted in parallel with its opspec.

    If that is correct a MOV rs,rd or ADD rs,rd would take 2 clocks to decode,
    and a MOV offset(rs),rd would take 3 clocks to decode.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Aug 29 10:34:31 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    John Levine <johnl@taugh.com> posted:

    It appears that Waldek Hebisch <antispam@fricas.org> said:
    My idea was that instruction decoder could essentially translate

    ADDL (R2)+, R2, R3

    into

    MOV (R2)+, TMP
    ADDL TMP, R2, R3
    But how about this?

    ADDL3 (R2)+,(R2)+,(R2)+

    Now you need at least two temps, the second of which depends on the
    first, and there are instructions with six operands. Or how about
    this:

    ADDL3 (R2)+,#1234,(R2)+

    This is encoded as

    OPCODE (R2)+ (PC)+ <1234> (R2)+

    The immediate word is in the middle of the instruction. You have to decode >> the operands one at a time so you can recognize immediates and skip over them.
    It must have seemed clever at the time, but ugh.


    What we must all realize is that each address mode in VAX was a microinstruction all unto itself.

    And that is why it was not pipelineable in any real sense.

    Yes. The instructions are designed to parsed by a byte-code interpreter
    in microcode. Even the NVAX in 1992 its Decode can only produce one
    operand per clock.

    If that operand is one of the complex memory address modes then it
    might be possible to dispatch it and let the back end chew on it
    while Decode works on the second operand.

    But that assumes the operands are in slow memory. If they are in fast
    registers then it stalls waiting for the second and third operands to be decoded making a pipeline pointless.

    And since programs mostly put operands in registers it stalls at Decode.

    One might say we should just build a fast decoder. But if you look at
    the instruction formats, even the simplest 2 register instructions are
    3 bytes and would require looking at 24 instruction bits and 3 valid bits
    or 27 bits at once. The 3 operand rs1,rs2,rd instructions is 36 bits.

    That decoder has to deal with 2^27 or 2^36 possibilities!
    And that just handles 2 and 3 register instructions, no memory references.

    It is hypothetically possible with a pre-decode stage to compact those
    down to 17 bits for 2 register and 21 bits for 3 register but that is
    still too many possibilities. That just throws transistors at a problem
    that never needed to exist in the first place, and would still not be affordable in 1992 NMOS, certainly not in 1975 TTL.

    If we look at what the VAX is actually spending most of its time on,
    2 and 3 register ALU operations, those can be decoded in parallel by
    looking at 10 bits (8 opcode + 2 valid) for 2 register,
    15 bits (12 opcode + 3 valid) for 3 register instructions.
    Which is quite doable in 1975 TTL in 1 clock.
    And that allows the pipeline to not stall at Decode.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Aug 29 17:07:03 2025
    From Newsgroup: comp.arch

    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.

    In most cases you have either one of the other but not both. E.g. in
    C we don't have nested functions, and in Javascript functions are heap-allocated objects.

    Other than GNU C (with its support for nested functions), which other
    language has this weird combination of features?


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 30 01:47:03 2025
    From Newsgroup: comp.arch

    On 8/29/2025 4:07 PM, Stefan Monnier wrote:
    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.

    In most cases you have either one of the other but not both. E.g. in
    C we don't have nested functions, and in Javascript functions are heap-allocated objects.

    Other than GNU C (with its support for nested functions), which other language has this weird combination of features?


    FWIW, BGBCC has this (as both a C extension, and within my rarely-used
    BS2 language).

    But, yeah, in this case, the general idea is that lambdas consist of 2
    or 3 parts:
    The function body, located in ".text";
    The data area holding the captured scope;
    An executable "thunk", which loads the data pointer and transfers
    control to the body (may be either RWX memory, or from a "pool" of
    possible function pointers).


    When implemented as RWX heap, the data area directly follows the thunk,
    and both are located in special executable heap memory. An
    automatic-only capture-by-reference form exists, but still uses heap
    memory for this (but these heap allocations will be freed automatically).

    So, the lambdas look the same as normal C function pointers in this way,
    but creating new lambdas may leak memory if they are not freed.


    There is another option which I have used sometimes which doesn't
    require RWX memory, but which may technically abuse the C ABI:
    Create a pool of functions with a more generic argument list, and then allocate lambdas from the pool. Each function in the pool pulls its
    data-area pointer from an array, with each function in the pool having a corresponding array index (with a set upper limit to the maximum number
    of lambdas).

    Though, arguably, if the number of "live lambdas" is large, or the
    lambdas are never freed, arguably there is a problem with the program
    (and even if an implementation has a hard limit, say, of 256 or 1024
    live lambda instances, this usually isn't too much of a problem).


    This strategy works better for ABIs which pass every argument in
    basically the same way (or can be made to look like such). If these
    functions need to care about argument number or types (*), it becomes a
    much harder problem.

    *: Though, usually limited to a scheme like JVM-style I/L/F/D/A, as this
    is sufficient, but "X*5^(1..n)" is still a much bigger number than X,
    meaning 'n' (the maximum number of arguments) would need to be kept
    small. This does not scale well...


    For contrast, if one knows, for example, that in the ABI every relevant argument is passed the same regardless of type (say, as a fixed 64-bit element), and that any 128 bit arguments are passed as an even numbered
    pair or similar (and we can always pretend as if we are passing the
    maximum number of arguments). Things become simpler.

    This later leaves the use of any executable as mostly optional, but
    unlike the pool; executable memory has no set limit on the maximum
    number of lambdas. There are tradeoffs either way.


    Can note that on my ABI designs, the RISC-V LP64 ABI, and the Win64 ABI,
    this property mostly holds. On the SysV AMD64 ABI, or RISC-V LP64D ABI,
    it does not. Can note that BGBCC when targeting RV64 currently uses a
    variant of the LP64 ABI.

    For XG3, it may use either the LP64 ABI, or an experimental "XG3 Native"
    ABI which differs slightly:
    X10..X17 are used for arguments 1..8;
    F10..F17 are used for arguments 9..16;
    F4..F7 are reassigned to being callee-save.
    Partly balancing out the register mix.
    X: 15 scratch; 12 callee-save
    F: 16 scratch; 16 callee-save.
    So: 31 scratch, 28 callee-save.
    Vs: 35 scratch, 24 callee-save.
    Struct pass/return:
    1-8 bytes: 1 register/spot;
    9-16 bytes: 2 registers/spots, padded to an even index.
    17+: pass/return via pointer.
    For struct return, an implicit argument is passed;
    Callee copies returned struct to the address passed by caller.


    Though, another partial motivation for this sort of thing is to make it simpler to marshal COM-style interfaces (it lessens the burden on the
    lower levels to need to care about the method signatures for the
    marshaled objects). Though, a higher level mechanism, such as an RPC implementation, would still need to know about the method signatures.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Aug 30 15:36:46 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.

    In most cases you have either one of the other but not both. E.g. in
    C we don't have nested functions, and in Javascript functions are heap-allocated objects.

    Other than GNU C (with its support for nested functions), which other language has this weird combination of features?

    Well, more precisely:
    - function pointer is supposed to take the same space as a single
    machine address
    - function pointer is supposed to be directly invokable, that is
    point to machine code of the function
    - one wants to support nested functions
    - there is no garbage collector, one does not want to introduce extra
    stack and one does not want to leak storage allocated to nested
    functions.

    To explain more:
    - arguably in "safe" C data pointers should consist
    of 3 machine words, such pointer have place for extra data needed
    for nested functions.
    - some calling conventions introduce extra indirection, that is
    function pointer point to a data structure containing address
    of machine code and extra data needed by nested functions.
    Function call puts extra data in dedicated machine register and
    then transfers control via address contained in function data
    structure. IIUC IBM AIX uses such approach.
    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.

    Concerning languages, any language which has nested functions and
    wants seamless cooperation with C needs to resolve the problem.
    That affects Pascal, Ada, PL/I. That is basicaly most classic
    non-C languages. IIUC several "higher level" languages resolve
    the trouble by combination of parallel stack and/or GC. But
    when language want to compete with efficiency of C and does not
    want GC, then trampolines allocated on machine stack may be the
    only choice (on register starved machine parallel stack may be
    too expensive). AFAIK GNU Ada uses (or used) trampolines
    allocated on machine stack.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 30 13:19:46 2025
    From Newsgroup: comp.arch

    On 8/30/2025 10:36 AM, Waldek Hebisch wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.

    In most cases you have either one of the other but not both. E.g. in
    C we don't have nested functions, and in Javascript functions are
    heap-allocated objects.

    Other than GNU C (with its support for nested functions), which other
    language has this weird combination of features?

    Well, more precisely:
    - function pointer is supposed to take the same space as a single
    machine address
    - function pointer is supposed to be directly invokable, that is
    point to machine code of the function
    - one wants to support nested functions
    - there is no garbage collector, one does not want to introduce extra
    stack and one does not want to leak storage allocated to nested
    functions.


    Yes.

    To explain more:
    - arguably in "safe" C data pointers should consist
    of 3 machine words, such pointer have place for extra data needed
    for nested functions.

    No one wants to pay for this...
    Even 2 machine words per pointer is a hard sell.
    Much less the mess created by any programs that use integer->pointer
    casts (main option I can think of is turning these cases into runtime
    calls).

    I have experimentally used bounds checking in the past, although the
    main form I ended up using sort of crudely approximates the bounds (with
    a minifloat style format) and shoves them into the high 16 bits of the
    pointer (with 0x0000 still allowed for untagged C pointers). It was more limited, more intended to deal with common case "out of bounds" bugs
    rather than do anything security related.

    - some calling conventions introduce extra indirection, that is
    function pointer point to a data structure containing address
    of machine code and extra data needed by nested functions.
    Function call puts extra data in dedicated machine register and
    then transfers control via address contained in function data
    structure. IIUC IBM AIX uses such approach.

    Yes. Also FDPIC generally falls into that category.

    Function pointer consists of a pointer to a blob of memory holding a
    code pointer and typically the callee's GOT pointer.

    Would be easier to implement lambdas on top of an FDPIC style ABI in
    this way, but FDPIC tends to kinda suck in terms of having more
    expensive function calls.


    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.


    One could classify them based on lifetime:
    Automatic, freed automatically when parent function exits;
    Global, may live indefinitely, but needs to be freed.


    General strategy for auto-freeing is to (internally) build a linked list
    of items that need to be auto-freed when the current function terminates
    (in the epilog, it calls a hidden runtime function which frees
    everything in the list, with each item both providing a data pointer and
    the function needed to free said pointer). This mechanism had also been extended to "alloca()", C99 style VLAs, and any large by-value structs
    or arrays too large to reasonably fit into the current local frame.

    The internal allocation calls can be provided with a double-indirect
    pointer to the linked list of items to be freed.


    Concerning languages, any language which has nested functions and
    wants seamless cooperation with C needs to resolve the problem.
    That affects Pascal, Ada, PL/I. That is basicaly most classic
    non-C languages. IIUC several "higher level" languages resolve
    the trouble by combination of parallel stack and/or GC. But
    when language want to compete with efficiency of C and does not
    want GC, then trampolines allocated on machine stack may be the
    only choice (on register starved machine parallel stack may be
    too expensive). AFAIK GNU Ada uses (or used) trampolines
    allocated on machine stack.


    Yes, true. But, for other reasons we really don't want RWX on the main
    stack.

    Also, GC is generally too expensive and doesn't play well with many
    use-cases (more or less anything that is timing sensitive).


    Automatic reference counting is sometimes an option, but tends to carry
    too much overhead in the general case (but, may make sense for "dynamic"/"variant" types, where one is already paying for the overhead
    of dynamic type-tag checking). But, wouldn't want to use it for general pointers or function pointers in a C like language.

    Refcounting mostly works OK for dynamic types and similar, and has less performance issues than, say, mark/sweep. Usual big downside is that any cyclic structures will tend to be leaked.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sat Aug 30 14:22:32 2025
    From Newsgroup: comp.arch

    Function pointer consists of a pointer to a blob of memory holding
    a code pointer and typically the callee's GOT pointer.

    Better skip the redirection and make function pointers take up 2 words
    (address of the code plus address of the context/environment/GOT), so
    there's no dynamic allocation involved.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 30 13:46:05 2025
    From Newsgroup: comp.arch

    On 8/30/2025 1:22 PM, Stefan Monnier wrote:
    Function pointer consists of a pointer to a blob of memory holding
    a code pointer and typically the callee's GOT pointer.

    Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so
    there's no dynamic allocation involved.


    FDPIC typically always uses the normal pointer width, just with more indirection:
    Load target function pointer from GOT;
    Save off current GOT pointer to stack;
    Load code pointer from function pointer;
    Load GOT pointer from function pointer;
    Call function;
    Reload previous GOT pointer.

    It, errm, kinda sucks...

    Seemingly, thus far no FDPIC on RV64 targets (in GCC), but does
    apparently exist for RV32 No-MMU style targets.


    I took a lower overhead approach in my PBO ABI (optional callee side
    GBR/GP reload), but it lacks any implicit way to implement lambdas.



    Stefan

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 31 05:36:53 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.
    ...
    Other than GNU C (with its support for nested functions), which other >language has this weird combination of features?

    Forth has single-cell (cell=machine word) execution tokens (~function pointers). In contrast to C, where in theory you could have bigger
    function pointers and only ABIs and legacy code encourage function
    pointers that fit in a single machine word, in a Forth you would have
    to make all cells bigger (i.e., all integers and all addresses) if you
    want more space for xts.

    Gforth adds closures, that of course have to be represented by
    single-cell execution tokens that behave like other execution tokens.
    But Gforth has the advantage that xts are not represented by addresses
    of machine code, and instead there is one indirection level between
    the xt and the machine code. The reason for that is that xts not only represent colon definitions (~C functions), but also variables,
    constants, and words defined with create...does> (somewhat
    closure-like, but "statically" allocated). So Gforth implements
    closures using an extension of that mechanism, see Section 4 of [ertl&paysan18].

    However, there are Forth systems that implement xts as addresses of
    machine code, and if they implemented closures, they would need to use
    run-time code generation.

    - anton

    @InProceedings{ertl&paysan18,
    author = {M. Anton Ertl and Bernd Paysan},
    title = {Closures --- the {Forth} way},
    crossref = {euroforth18},
    pages = {17--30},
    url = {https://www.complang.tuwien.ac.at/papers/ertl%26paysan.pdf},
    url2 = {http://www.euroforth.org/ef18/papers/ertl.pdf},
    slides-url = {http://www.euroforth.org/ef18/papers/ertl-slides.pdf},
    video = {https://wiki.forth-ev.de/doku.php/events:ef2018:closures},
    OPTnote = {refereed},
    abstract = {In Forth 200x, a quotation cannot access a local
    defined outside it, and therefore cannot be
    parameterized in the definition that produces its
    execution token. We present Forth closures; they
    lift this restriction with minimal implementation
    complexity. They are based on passing parameters on
    the stack when producing the execution token. The
    programmer has to explicitly manage the memory of
    the closure. We show a number of usage examples.
    We also present the current implementation, which
    takes 109~source lines of code (including some extra
    features). The programmer can mechanically convert
    lexical scoping (accessing a local defined outside)
    into code using our closures, by applying assignment
    conversion and flat-closure conversion. The result
    can do everything one expects from closures,
    including passing Knuth's man-or-boy test and living
    beyond the end of their enclosing definitions.}
    }

    @Proceedings{euroforth18,
    title = {34th EuroForth Conference},
    booktitle = {34th EuroForth Conference},
    year = {2018},
    key = {EuroForth'18},
    url = {http://www.euroforth.org/ef18/papers/proceedings.pdf}
    }
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Aug 31 16:21:38 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 8/30/2025 1:22 PM, Stefan Monnier wrote:
    Function pointer consists of a pointer to a blob of memory holding
    a code pointer and typically the callee's GOT pointer.

    Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so there's no dynamic allocation involved.


    FDPIC typically always uses the normal pointer width, just with more indirection:
    Load target function pointer from GOT;
    Save off current GOT pointer to stack;
    Load code pointer from function pointer;
    Load GOT pointer from function pointer;
    Call function;
    Reload previous GOT pointer.

    My 66000 can indirect through GOT so the above sequence is::

    CALX [ip,,GOT[n]-.]

    and references to GOT are like above (functions) or (extern) as::

    LDD Rp,[ip,,GOT[n]-.]

    Each linked module gets its own GOT.

    It, errm, kinda sucks...

    Bad ISA makes many things suck--whereas good ISA does not.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Aug 31 18:04:44 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are:
    Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test.

    Size is one thing, sooner or later one has to execute the instructions,
    and here My 66000needs to execute fewer, while being within spitting
    distance of code size.

    Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions,

    3 for My 66000

    so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.
    * My 66000 uses ST immediate for globals

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch,alt.folklore.computers on Sun Aug 31 16:43:26 2025
    From Newsgroup: comp.arch

    Apr 2003: Opteron launch
    Sep 2003: Athlon 64 launch
    Oct 2003 (IIRC): I buy an Athlon 64
    Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC

    I installed Fedora Core 1 on my Athlon64 box in early 2004.

    Why wait for MS?

    Same here (tho I was on team Debian), but I don't think GNU/Linux
    enthusiasts were the main buyers of those Opteron and
    Athlon64 machines.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch,alt.folklore.computers on Sun Aug 31 22:26:43 2025
    From Newsgroup: comp.arch

    On Sun, 31 Aug 2025 16:43:26 -0400, Stefan Monnier wrote:

    ... I don't think GNU/Linux enthusiasts were the main buyers of
    those Opteron and Athlon64 machines.

    Their early popularity would have been in servers. And servers were
    already becoming dominated by Linux in those days.

    “Opteron” was specifically a brand name for server chips, as I recall.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Mon Sep 1 06:07:27 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Apr 2003: Opteron launch
    Sep 2003: Athlon 64 launch
    Oct 2003 (IIRC): I buy an Athlon 64
    Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC

    I installed Fedora Core 1 on my Athlon64 box in early 2004.

    Why wait for MS?

    Same here (tho I was on team Debian)

    I would have liked to install 64-bit Debian (IIRC I initially ran
    32-bit Debian on the Athlon 64), but they were not ready at the time,
    and still busily working on their multi-arch (IIRC) plans, so
    eventually I decided to go with Fedora Core 1, which just implemented
    /lib and /lib64 and was there first.

    For some reason I switched to Gentoo relatively soon after
    (/etc/hostname from 2005-02-20, and IIRC Debian still had not finished hammering out multi-arch at that time), before finally settling in
    Debian-land several years later.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch,alt.folklore.computers on Mon Sep 1 06:57:26 2025
    From Newsgroup: comp.arch

    On Mon, 01 Sep 2025 06:07:27 GMT, Anton Ertl wrote:

    I would have liked to install 64-bit Debian (IIRC I initially ran
    32-bit Debian on the Athlon 64), but they were not ready at the time
    ... so eventually I decided to go with Fedora Core 1, which just
    implemented /lib and /lib64 and was there first.

    For some reason I switched to Gentoo relatively soon after ...
    before finally settling in Debian-land several years later.

    Distro-hopping is a long-standing tradition in the Linux world. No other platform comes close.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Mon Sep 1 07:40:47 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I would have liked to install 64-bit Debian (IIRC I initially ran
    32-bit Debian on the Athlon 64), but they were not ready at the time,
    and still busily working on their multi-arch (IIRC) plans, so
    eventually I decided to go with Fedora Core 1, which just implemented
    /lib and /lib64 and was there first.

    For some reason I switched to Gentoo relatively soon after
    (/etc/hostname from 2005-02-20, and IIRC Debian still had not finished >hammering out multi-arch at that time), before finally settling in >Debian-land several years later.

    Reading some more, Debian 4.0 (Etch), released 8 April 2007, was the
    first Debian with official AMD64 support.

    Multiarch was introduced in Debian 7 (Wheezy), released 4 May 2013.

    So Multiarch took much longer than they had originally expected, and
    they apparently settled for the lib64 approach for Debian 4-6.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch,alt.folklore.computers on Mon Sep 1 12:15:54 2025
    From Newsgroup: comp.arch

    Reading some more, Debian 4.0 (Etch), released 8 April 2007, was the
    first Debian with official AMD64 support.

    Indeed, I misremembered: I used Debian's i386 port on my 2003 AMD64
    machine.
    It didn't have enough RAM to justify the bother of distro hopping. 🙂


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Mon Sep 1 11:33:33 2025
    From Newsgroup: comp.arch

    On 9/1/2025 11:15 AM, Stefan Monnier wrote:
    Reading some more, Debian 4.0 (Etch), released 8 April 2007, was the
    first Debian with official AMD64 support.

    Indeed, I misremembered: I used Debian's i386 port on my 2003 AMD64
    machine.
    It didn't have enough RAM to justify the bother of distro hopping. 🙂


    My first AMD64 machine also ended up mostly running 32-bit OS's, but
    more because initially:
    It was unstable if running 64-bit Linux;
    It was also not very stable initially with XP-X64;
    Driver support for XP-X64, initially, was almost non existent.
    So, ended up mostly running 32-bit WinXP on the thing.

    Though, after the initial weak results, on my next machine I had a
    better experience and IIRC had it set up to dual boot XP-X64 and Fedora,
    by that point stuff was stable and XP-X64 had drivers for stuff. I stuck
    with XP-X64 mostly as Vista was kinda trash (until some years later
    jumping to Win7, and now Win10).


    Well, and (at least in these years) Linux still had serious issues with
    driver compatibility, so you can use the OS but typically with no 3D acceleration or sound (and Mesa-GL in SW mode is horribly slow).

    At least Ethernet tended to work as most MOBOs had settled on the
    RTL8139 or similar (well, until MOBOs started having Gigabit Ethernet,
    and suddenly Ethernet no longer worked in Linux for a while, ...).

    Well, Linux land often failing to provide a great experience, not so
    much because of UI (and, I actually like using the Bash shell for
    stuff), but because of ever-present hardware support issues (so, can't
    usually end up running it as the main OS as much of the HW often didn't
    work).

    ...



    Stefan

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Mon Sep 1 20:34:13 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    It didn't have enough RAM to justify

    My Athlon 64 only had 1GB of RAM, so an IA-32 distribution would have
    done nicely for it, but I wanted to be able to build and run AMD64
    software.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 4 15:23:26 2025
    From Newsgroup: comp.arch


    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are:
    Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    At some scale, smaller code size is beneficial, but once the implementation
    has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    What say ye !

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test.

    Size is one thing, sooner or later one has to execute the instructions,
    and here My 66000needs to execute fewer, while being within spitting
    distance of code size.

    Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions,

    3 for My 66000

    so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.
    * My 66000 uses ST immediate for globals

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Sep 4 10:25:49 2025
    From Newsgroup: comp.arch

    On 9/4/2025 8:23 AM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    What say ye !

    In general yes, but as you pointed out in another post, if you are
    talking about a GBOoO machine, it isn't the absolute number of
    instructions (because of parallel execution), but the number of cycles
    to execute a particular routine. Of course, this is harder to tell at a glance from a code listing.

    And, of course your "150%" is arbitrary, but I agree that small
    differences in code size are not important, except in some small
    embedded applications.

    And I guess I would add, as a third, much lower priority, power usage.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 4 21:00:36 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 9/4/2025 8:23 AM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    In general yes, but as you pointed out in another post, if you are
    talking about a GBOoO machine, it isn't the absolute number of
    instructions (because of parallel execution), but the number of cycles
    to execute a particular routine. Of course, this is harder to tell at a glance from a code listing.

    I can't seem to find the code examples from the snipped examples anywhere.

    For arrays:
    The inner loops are 4 instructions (3 for My 66000) and the loop is 2×
    data dependent on the integer ADDs, so all 4 instructions can be pitched
    at 1-cycle. Let us assume the loop is executed 10×, so 10 loop-latencies
    is 10-cycles plus LD-latency plus ADD latency:: {using LD-latency = 4}

    setup
    | MOV #0 |
    | MOV #0 |
    loop[0] | LD AGEN|rot | Cache | LD align |rot | D ADD |
    | LP ADD | BLT ! |
    loop[1] | LD AGEN|rot | Cache | LD align |rot | D ADD |
    | LP ADD | BLT ! |
    loop[2] | LD AGEN|rot | Cache | LD align |rot | D ADD |
    | LP ADD | BLT × | repair |
    exit
    | MOV |
    | RET |
    | looping | recovery |

    // where rot is time it takes to route AGEN to the SRAM arrays and back,
    // and showing the exit of the loop by mispredicting the last branch back
    // to the top of the loop, 2-cycle repairing state, and returning from
    // subroutine.

    Any µarchitecture that can start 1 LD per cycle, start 2 integer ADDs
    per cycle, and 1 branch per cycle, has enough resources to perform
    arrays as drawn above.

    For globals:

    RV64GC does 4 LDs and 4 STs, each ST being data dependent on 1 LD.
    It is conceivable that a 10-wide machine might do 4 LDs in a cycle,
    and figure out that the 4 values are in the same cache line, so the
    latency of calculation is LD-latency + ST AGEN. Let's say LD-latency
    is 4-cycles, so the calculation latency is 5-cycles. RET can probably
    be performed simultaneous with the first LD AGEN.

    My 66000 does 4 parallel ST # all of which can start on the same cycle,
    as can RET, for a latency of 1-cycle.

    On the other hand:: My 66000 implementation may only be 6-wide and
    the 4 STs take 2-execution-cycles, but the RET still takes place in
    cycle-1.

    At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller
    AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    What say ye !

    And, of course your "150%" is arbitrary,

    yes, of course, completely arbitrary--but this is the typical RISC-CISC instruction count ratio. Now, on the other hand, My 66000 runs closer to
    115% size and 70% RISC-V count {although the examples above are 66% and
    55%}.

    but I agree that small
    differences in code size are not important, except in some small
    embedded applications.

    And I guess I would add, as a third, much lower priority, power usage.

    I would suggest power has become a second order desire (up from third)
    {maybe even a primary desire at some scales}.

    But note: Nothing delivers a fixed bit-pattern as an operand at lower
    power than plucking the bits from the instruction stream; saving a
    good deal of the power consumed by forwarding (the multiple comparators
    and the find youngest logic plus the buffers to drive the result-to-
    operand multiplexers).

    And certainly: plucking the bit-pattern from the instruction stream is
    vastly lower power than LDing the bit-pattern from memory ! close to
    4× lower.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Sep 4 16:54:21 2025
    From Newsgroup: comp.arch

    On 9/4/2025 12:25 PM, Stephen Fuld wrote:
    On 9/4/2025 8:23 AM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
       Crappy with arrays;
       Crappy with code with lots of large immediate values;
       Crappy with code which mostly works using lots of global variables; >>>>>      Say, for example, a lot of Apogee / 3D Realms code;
         They sure do like using lots of global variables.
         id Software also likes globals, but not as much.
       ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
       long i, r;
       for (i=0, r=0; i<n; i++)
         r+=v[i];
       return r;
    }

    arrays:
          MOV  R3,#0
          MOV  R4,#0
          VEC  R5,{}
          LDD  R6,[R1,R3<<3]
          ADD  R4,R4,R6
          LOOP LT,R3,#1,R2
          MOV  R1,R4
          RET


    long a, b, c, d;

    void globals(void)
    {
       a = 0x1234567890abcdefL;
       b = 0xcdef1234567890abL;
       c = 0x567890abcdef1234L;
       d = 0x5678901234abcdefL;
    }

    globals:
         STD #0x1234567890abcdef,[ip,a-.]
         STD #0xcdef1234567890ab,[ip,b-.]
         STD #0x567890abcdef1234,[ip,c-.]
         STD #0x5678901234abcdef,[ip,d-.]
         RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC)
    are:
         Bytes                         Instructions
    arrays globals    Architecture  arrays    globals
    28     66 (34+32) RV64GC            12          9 >>>> 27     69         AMD64             11          9
    44     84         ARM A64           11         22
       32     68         My 66000           8          5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    At some scale, smaller code size is beneficial, but once the
    implementation
    has a GBOoO µarchitecture, I would think that fewer instructions is
    better
    than smaller code--so long as the code size is less than 150% of the
    smaller
    AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    What say ye !

    In general yes, but as you pointed out in another post, if you are
    talking about a GBOoO machine, it isn't the absolute number of
    instructions (because of parallel execution), but the number of cycles
    to execute a particular routine.  Of course, this is harder to tell at a glance from a code listing.

    And, of course your "150%" is arbitrary, but I agree that small
    differences in code size are not important, except in some small
    embedded applications.


    Yeah.

    Main use case where code side is a big priority is when trying to fit
    code into a small fixed-size ROM. If loading into RAM, and the RAM is non-tiny, then generally exact binary size is much less important, and
    as long as it isn't needlessly huge/bloated, it doesn't matter too much.

    For traditional software, often data/bss, stack, and heap memory, will
    be the dominant factors for overall RAM usage.

    For a lot of command-line tools, often there will be a lot of code for relatively little RAM use, but then the priority is less about minimal code-size (though often small code size will matter more than
    performance for many such tools), but the overhead of creating and
    destroying process instances.

    ...


    And I guess I would add, as a third, much lower priority, power usage.


    It depends:
    For small embedded devices, power usage often dominates;
    Usually, this is most effected by executing as few instructions as
    possible while also using the least complicated hardware logic to
    perform those instructions.



    For a lot of DSP tasks, power use is a priority, while often doing lots
    of math operations, in which case one often wants FPUs and similar with
    the minimal sufficient precision (so, for example, rocking it with lots
    of Binary16 math, and FPUs which natively operate on Binary16); or a lot
    of 8 and 16 bit integer math.

    While FP8 is interesting, sadly direct FP8 math is often too low of
    precision for many tasks.


    I guess the issue then becomes one of the cheapest-possible Binary16
    capable FPU (both in terms of logic resources and energy use).

    Ironically, one option here being to use log-scaled values (scaled to
    mimic Binary16) and just sort of pass it off as Binary16. If one
    switches entirely to log-scaled math, then it can be at least be self-consistent. However, if mixed/matched with "real" Binary16,
    typically only the top few digits will match up.

    Where, as noted, it works well at low precision, but scales poorly (and
    even Binary16 is pushing it).

    Though, unclear about ASIC space.



    For integer math, it might make sense to use a lot of zero-extended
    16-bit math, since using sign-extended math would likely waste more
    energy flipping all the high order bits for sign extension.

    Well, or other possible even more wacky options, like zigzag-folded gray
    coded byte values.

    Though, it would be kinda wacky/nonstandard, if ALU operations fold the
    sign into the LSB and use gray-coding for the value, then arithmetic
    could be performed while minimizing the number of bit flips and thus potentially using less total energy for registers and memory operations.

    Though, potentially, the CPU could be made to look as-if it were
    operating on normal twos complement math; since if the arithmetic
    results are the same, it might be essentially invisible to the software
    that numbers are being stored in a nonstandard way.

    Or, say, mapping from twos complement to folded bytes (with every byte
    being folded):
    00->00, 01->02, 02->06, 03->04, ...
    FF->01, FE->03, FD->07, FC->05, ...
    So, say, a value flipping sign would typically only need to flip a small fraction of the number of bits (and the encode/decode process would
    mostly consist of bitwise XORs).


    Though, might still make sense to keep things "normal" in the ALU and
    CPU registers, but then apply such a transform at the level of the
    memory caches (and in external RAM). A lot may depend on the energy cost
    of performing this transformation though (and, it does implicitly assume
    that much of the RAM is holding signed integer values).

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 5 15:03:15 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    So, the overall sizes (including data size for globals() on RV64GC) are: >> > Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    Performance from a given chip area.

    The RISC-V people argue that they can combine instructions with a few transistors. But, OTOH, they have 16-bit and 32-bit wide
    instructions, which means that a part of the decoder results will be
    thrown away, increasing the decode cost for a given number of average
    decoded instructions per cycle. Plus, they need more decoded
    instructions per cycle for a given amount of performance.

    Intel and AMD demonstrate that you can get high performance even with
    an instruction set that is even worse for decoding, but that's not cheap.

    ARM A64 goes the other way: Fixed-width instructions ensure that all
    decoding on correctly predicted paths is actually useful.

    However, it pays for this in other ways: Instructions like load pair
    with auto-increment need to write 3 registers, and the write port
    arbitration certainly has a hardware cost. However, such an
    instruction would need two loads and an add if expressed in RISC-V; if
    RISC-V combines these instructions, it has the same write-port
    arbitration problem. If it does not combine at least the loads, it
    will tend to perform worse with the same number of load/store units.

    So it's a balancing game: If you lose some weight here, do you need to
    add the same, more, or less weight elsewhere to compensate for the
    effects elsewhere?

    At some scale, smaller code size is beneficial, but once the implementation >has a GBOoO µarchitecture, I would think that fewer instructions is better >than smaller code--so long as the code size is less than 150% of the smaller >AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    I don't think that even VAX encoding would be the major problem of the
    VAX these days. There are microop caches and speculative decoders for
    that (although, as EricP points out, the VAX is an especially
    expensive nut to crack for a speculative decoder).

    In any case, if smaller code size was it, RV64GC would win according
    to my results. However, compilers often generate code that has a
    bigger code size rather than a smaller one (loop unrolling, inlining),
    so code size is not that important in the eyes of the maintainers of
    these compilers.

    I also often see code produced with more (dynamic) instructions than
    necessary. So the number of instructions is apparently not that
    important, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 5 14:26:28 2025
    From Newsgroup: comp.arch

    On 9/5/2025 10:03 AM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    So, the overall sizes (including data size for globals() on RV64GC) are: >>>> Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    Performance from a given chip area.

    The RISC-V people argue that they can combine instructions with a few transistors. But, OTOH, they have 16-bit and 32-bit wide
    instructions, which means that a part of the decoder results will be
    thrown away, increasing the decode cost for a given number of average
    decoded instructions per cycle. Plus, they need more decoded
    instructions per cycle for a given amount of performance.

    Intel and AMD demonstrate that you can get high performance even with
    an instruction set that is even worse for decoding, but that's not cheap.

    ARM A64 goes the other way: Fixed-width instructions ensure that all
    decoding on correctly predicted paths is actually useful.

    However, it pays for this in other ways: Instructions like load pair
    with auto-increment need to write 3 registers, and the write port
    arbitration certainly has a hardware cost. However, such an
    instruction would need two loads and an add if expressed in RISC-V; if
    RISC-V combines these instructions, it has the same write-port
    arbitration problem. If it does not combine at least the loads, it
    will tend to perform worse with the same number of load/store units.

    So it's a balancing game: If you lose some weight here, do you need to
    add the same, more, or less weight elsewhere to compensate for the
    effects elsewhere?


    It is tradeoffs...

    Load/Store Pair helps, and isn't too bad if one already has the register
    ports (if it is at least a 2-wide superscalar, you can afford it with
    little additional cost).

    Auto-increment slightly helps with code density, but is a net negative
    in other ways. Depending on implementation, some of its more obvious
    use-cases (such as behaving like a PUSH/POP) may end up slower than
    using separate instructions.


    Say, the most obvious way to implement auto-increment in my case would
    be to likely have the instruction decode as if there were an implicit
    ADD being executed in parallel.

    Say:
    MOV.Q (R10)+, R18
    MOV.Q R19, -(R11)
    Behaving as:
    ADD 8, R10 | MOV.Q (R10), R18
    ADD -8, R11 | MOV.Q R19, -8(R11)
    So far, so good... Both execute with a 1-cycle latency, but...
    MOV.Q R18, -(R2)
    MOV.Q R19, -(R2)
    MOV.Q R20, -(R2)
    MOV.Q R21, -(R2)
    Would take 8 cycles rather than 4 (due to R2 dependencies).

    Vs:
    MOV.Q R18, -8(R2) //*1
    MOV.Q R19, -16(R2)
    MOV.Q R20, -24(R2)
    MOV.Q R21, -32(R2)
    ADD -32, R2
    Needing 5 cycles (or, maybe 4, if the superscalar logic is clever and
    can run the ADD in parallel with the final MOV.Q).

    *1: Where, "-8(R2)" and "(R2, -8)" are analogous as far as BGBCC's ASM
    parser are concerned, but the former is more traditional, so figured I
    would use it here.


    Likewise, in C if you write:
    v0=*cs++;
    v1=*cs++;
    And it were compiled as auto-increment loads, it could also end up
    slower than a Load+Load+ADD sequence (for the same reason).

    But, what about:
    v0=*cs++;
    //... something else unrelated to cs (or v0).
    Well, then the ADD gets executed in parallel with whatever follows, so
    may still work out to a 1-cycle latency in this case.


    And, a feature is not particularly compelling when its main obvious use
    cases would end up with little/no performance gain (or would actually
    end up slower than what one does in its absence).

    Only really works if one has a 1-cycle ADD.

    Where, otherwise, seemingly the only real advantage of auto-increment
    being to make the binaries slightly smaller.


    Wouldn't take much to re-add it though, as noted, the ancestor of the
    current backend was written for an ISA that did have auto-increment. I
    just sort of ended up dropping it as it wasn't really worth it. Not only
    was it not particularly effective, but tended to be a lot further down
    the ranking in terms of usage frequency of addressing modes. If one
    excludes using it for PUSH/POP, its usage frequency basically falls to
    "hardly anything". Otherwise, you can basically count how many times you
    see "*ptr++" or similar in C, this is about all it would ever end up
    being used; which even in C, is often relatively infrequent).




    But, yeah, can noted, the major areas where RISC-V tends to lose out
    IMHO are:
    Lack of Indexed Load/Store;
    Crappy handling of large constants and lack of large immediate values.

    I had noted before, that the specific combination of adding these features:
    Indexed Load/Store;
    Load/Store Pair;
    Jumbo Prefixes.
    Both improves code density over plain RV64G/RV64GC, and also gains a
    roughly 40-50% speedup in programs like Doom.

    While not significantly increasing logic cost over what one would
    already need for a 2-wide machine. Could make sense to skip them for a
    1-wide machine, but then you don't really care too much about
    performance if going for 1-wide.


    Then again, Indexed Load/Store, due to a "register port issue" for
    Indexed Store, does give a performance advantage to going 3 wide over 2
    wide even if the 3rd lane is rarely used otherwise.


    Though, one could argue:
    But, the relative delta (of assuming these features, over plain RV64GC)
    is slightly less if one assumes a CPU with 1 cycle latency on ALU
    instructions and similar. But, this is still kind of weak IMO (ideally,
    the latency cost of ADD and similar should effect everything equally,
    and that 2-cycle ADD and Shifts disproportionately hurts RV64G/RV64GC performance, is not to RV64G's merit).

    Well, and Zba helps, but not fully. If SHnADD still still has a 2c
    latency, well, your indexed load is still 3 cycles vs 5 cycles, but
    still worse than 1 cycle...

    And, statistically, indexed loads tend to be far too large of the
    dynamic instructions mix to justify cheaping out here. Even if static instruction counts make them seem less relevant, indexed loads also tend
    to be more concentrated inside loops (whereas fixed-displacement loads
    are more concentrated in prologs and epilogs). If one excludes the
    prolog and epilog related loads/stores, the proportion of indexed
    load/store goes up significantly.



    At some scale, smaller code size is beneficial, but once the implementation >> has a GBOoO µarchitecture, I would think that fewer instructions is better >> than smaller code--so long as the code size is less than 150% of the smaller >> AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    I don't think that even VAX encoding would be the major problem of the
    VAX these days. There are microop caches and speculative decoders for
    that (although, as EricP points out, the VAX is an especially
    expensive nut to crack for a speculative decoder).


    Well, if Intel and AMD could make x86 work... yeah...


    In any case, if smaller code size was it, RV64GC would win according
    to my results. However, compilers often generate code that has a
    bigger code size rather than a smaller one (loop unrolling, inlining),
    so code size is not that important in the eyes of the maintainers of
    these compilers.


    I haven't really tested, but I suspect one could improve over RV64GC
    slightly here.


    For example:

    * 00in-nnnn-iiii-0000 ADD Imm5s, Rn5 //"ADD 0, R0" = TRAP
    * 01in-nnnn-iiii-0000 LI Imm5s, Rn5
    * 10mn-nnnn-mmmm-0000 ADD Rm5, Rn5
    * 11mn-nnnn-mmmm-0000 MV Rm5, Rn5

    * 0000-nnnn-iiii-0100 ADDW Imm4u, Rn4
    * 0001-nnnn-mmmm-0100 SUB Rm4, Rn4
    * 0010-nnnn-mmmm-0100 ADDW Imm4n, Rn4
    * 0011-nnnn-mmmm-0100 MVW Rm4, Rn4 //ADDW Rm, 0, Rn
    * 0100-nnnn-mmmm-0100 ADDW Rm4, Rn4
    * 0101-nnnn-mmmm-0100 AND Rm4, Rn4
    * 0110-nnnn-mmmm-0100 OR Rm4, Rn4
    * 0111-nnnn-mmmm-0100 XOR Rm4, Rn4

    * 0iii-0nnn-0mmm-1001 ? SLL Rm3, Imm3u, Rn3
    * 0iii-0nnn-1mmm-1001 ? SRL Rm3, Imm3u, Rn3
    * 0iii-1nnn-0mmm-1001 ? ADD Rm3, Imm3u, Rn3
    * 0iii-1nnn-1mmm-1001 ? ADDW Rm3, Imm3u, Rn3
    * 1iii-0nnn-0mmm-1001 ? AND Rm3, Imm3u, Rn3
    * 1iii-0nnn-1mmm-1001 ? SRA Rm3, Imm3u, Rn3
    * 1iii-1nnn-0mmm-1001 ? ADD Rm3, Imm3n, Rn3
    * 1iii-1nnn-1mmm-1001 ? ADDW Rm3, Imm3n, Rn3

    * 0ooo-0nnn-0mmm-1101 ? SLL Rm3, Ro3, Rn3
    * 0ooo-0nnn-1mmm-1101 ? SRL Rm3, Ro3, Rn3
    * 0ooo-1nnn-0mmm-1101 ? AND Rm3, Ro3, Rn3
    * 0ooo-1nnn-1mmm-1101 ? SRA Rm3, Ro3, Rn3
    * 1ooo-0nnn-0mmm-1101 ? ADD Rm3, Ro3, Rn3
    * 1ooo-0nnn-1mmm-1101 ? SUB Rm3, Ro3, Rn3
    * 1ooo-1nnn-0mmm-1101 ? ADDW Rm3, Ro3, Rn3
    * 1ooo-1nnn-1mmm-1101 ? SUBW Rm3, Ro3, Rn3

    * 0ddd-nnnn-mmmm-0001 LW Disp3u(Rm4), Rn4
    * 1ddd-nnnn-mmmm-0001 LD Disp3u(Rm4), Rn4
    * 0ddd-nnnn-mmmm-0101 SW Rn4, Disp3u(Rm4)
    * 1ddd-nnnn-mmmm-0101 SD Rn4, Disp3u(Rm4)

    * 00dn-nnnn-dddd-1001 LW Disp5u(SP), Rn5
    * 01dn-nnnn-dddd-1001 LD Disp5u(SP), Rn5
    * 10dn-nnnn-dddd-1001 SW Rn5, Disp5u(SP)
    * 11dn-nnnn-dddd-1001 SD Rn5, Disp5u(SP)

    * 00dd-dddd-dddd-1101 J Disp10
    * 01dn-nnnn-dddd-1101 LD Disp5u(SP), FRn5
    * 10in-nnnn-iiii-1101 LUI Imm5s, Rn5
    * 11dn-nnnn-dddd-1101 SD FRn5, Disp5u(SP)

    Could achieve a higher average hit-rate than RV-C while *also* using
    less encoding space.


    Why? Partly because Reg4 (R8..R23) is less useless than Reg3 (R8..R15).

    Less shift range, but shifts are over-represented in RV-C, and the
    shifts that are present have a very low hit rate due to tending not to
    match the patterns that tend to exist in the compiler output (unlike
    ADD, shifts being far more likely to have different source and
    destination registers).


    The 3R/3RI instructions would still be limited to the "kinda useless"
    3-bit registers, but this still isn't exactly worse than what is already
    the case for RV-C (even if they still have a poor hit rate).

    I left out things like ADDI16SP and ADDI4SPN and similar, as these
    aren't frequent enough to be relevant here (nor do existing instances of
    "ADD SP, Imm, Rn" tend to hit within the limitations of "ADDI4SPN", as
    it is still borderline useless in BGBCC in this case, *1).


    *1: The only times Reg3 has an OK hit rate is in leaf functions, and
    there seems to be a strong negative correlation between leaf functions
    and stack arrays. Also at best, the underlying instruction tends to have
    a low hit-rate as, when a stack array is used semi-frequently, BGBCC
    tends to end up loading the address into a register and leaving it there
    for multiple uses (and, due to "quirks", if you access a local array in
    an inner loop, it will tend to end up in the fixed-assignment case, in
    which case the array address is loaded into a register one-off in the
    prolog). The ADDI4SPN instruction only really makes sense if one assumes
    that stack arrays are both very frequent (in leaf functions?) and/or
    that the compiler tends to load the address of the array into a scratch register and then immediately discard it again (neither of which seems
    true in my case).

    ADDI16SP would be relevant for prologs and epilogs, but has a
    statistical incidence too low to really justify a 16 bit encoding (in
    many cases, would only occur twice per function or so, which is
    statistically, a fairly low incidence rate).

    ...


    Though, that said, RVC in BGBCC still does seem to be semi-effective
    despite its limitations.



    I also often see code produced with more (dynamic) instructions than necessary. So the number of instructions is apparently not that
    important, either.


    Yeah, probably true.

    Often it seems better to try to minimize instruction-instruction
    dependency chains.


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 5 14:38:15 2025
    From Newsgroup: comp.arch

    On 9/5/2025 2:26 PM, BGB wrote:
    On 9/5/2025 10:03 AM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    ...

    And, self-correction...


    For example:

    * 00in-nnnn-iiii-0000  ADD        Imm5s, Rn5  //"ADD 0, R0" = TRAP
    * 01in-nnnn-iiii-0000  LI        Imm5s, Rn5
    * 10mn-nnnn-mmmm-0000  ADD        Rm5, Rn5
    * 11mn-nnnn-mmmm-0000  MV        Rm5, Rn5

    * 0000-nnnn-iiii-0100  ADDW        Imm4u, Rn4
    * 0001-nnnn-mmmm-0100  SUB        Rm4, Rn4
    * 0010-nnnn-mmmm-0100  ADDW        Imm4n, Rn4
    * 0011-nnnn-mmmm-0100  MVW        Rm4, Rn4 //ADDW  Rm, 0, Rn
    * 0100-nnnn-mmmm-0100  ADDW        Rm4, Rn4
    * 0101-nnnn-mmmm-0100  AND        Rm4, Rn4
    * 0110-nnnn-mmmm-0100  OR        Rm4, Rn4
    * 0111-nnnn-mmmm-0100  XOR        Rm4, Rn4

    * 0iii-0nnn-0mmm-1001 ? SLL        Rm3, Imm3u, Rn3
    * 0iii-0nnn-1mmm-1001 ? SRL        Rm3, Imm3u, Rn3
    * 0iii-1nnn-0mmm-1001 ? ADD        Rm3, Imm3u, Rn3
    * 0iii-1nnn-1mmm-1001 ? ADDW        Rm3, Imm3u, Rn3
    * 1iii-0nnn-0mmm-1001 ? AND        Rm3, Imm3u, Rn3
    * 1iii-0nnn-1mmm-1001 ? SRA        Rm3, Imm3u, Rn3
    * 1iii-1nnn-0mmm-1001 ? ADD        Rm3, Imm3n, Rn3
    * 1iii-1nnn-1mmm-1001 ? ADDW        Rm3, Imm3n, Rn3

    * 0ooo-0nnn-0mmm-1101 ? SLL        Rm3, Ro3, Rn3
    * 0ooo-0nnn-1mmm-1101 ? SRL        Rm3, Ro3, Rn3
    * 0ooo-1nnn-0mmm-1101 ? AND        Rm3, Ro3, Rn3
    * 0ooo-1nnn-1mmm-1101 ? SRA        Rm3, Ro3, Rn3
    * 1ooo-0nnn-0mmm-1101 ? ADD        Rm3, Ro3, Rn3
    * 1ooo-0nnn-1mmm-1101 ? SUB        Rm3, Ro3, Rn3
    * 1ooo-1nnn-0mmm-1101 ? ADDW        Rm3, Ro3, Rn3
    * 1ooo-1nnn-1mmm-1101 ? SUBW        Rm3, Ro3, Rn3


    ^ flip the LSB for all of the 3R instructions there, it seemed to be a
    screw up when coming up with my listing...

    But, these were the newest and most uncertain addition, as they use a
    big chunk of encoding space and aren't great for hit rate due to Reg3
    and similar.


    * 0ddd-nnnn-mmmm-0001  LW        Disp3u(Rm4), Rn4
    * 1ddd-nnnn-mmmm-0001  LD        Disp3u(Rm4), Rn4
    * 0ddd-nnnn-mmmm-0101  SW        Rn4, Disp3u(Rm4)
    * 1ddd-nnnn-mmmm-0101  SD        Rn4, Disp3u(Rm4)

    * 00dn-nnnn-dddd-1001  LW        Disp5u(SP), Rn5
    * 01dn-nnnn-dddd-1001  LD        Disp5u(SP), Rn5
    * 10dn-nnnn-dddd-1001  SW        Rn5, Disp5u(SP)
    * 11dn-nnnn-dddd-1001  SD        Rn5, Disp5u(SP)

    * 00dd-dddd-dddd-1101  J        Disp10
    * 01dn-nnnn-dddd-1101  LD        Disp5u(SP), FRn5
    * 10in-nnnn-iiii-1101  LUI        Imm5s, Rn5
    * 11dn-nnnn-dddd-1101  SD        FRn5, Disp5u(SP)

    Could achieve a higher average hit-rate than RV-C while *also* using
    less encoding space.


    Granted, more testing could be done.

    This partly came up as another possibility for a "compressed" XG3, which basically just trades the space used for predicated ops back for 16-bit ops.

    But, alas, RV-C doesn't hold up as well if you try to trim it down to
    2/3 the encoding space.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Sep 5 21:56:07 2025
    From Newsgroup: comp.arch

    On 2025-09-04 11:23 a.m., MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    What say ye !

    Things could be architect-ed to allow a tradeoff between code size and
    number of instructions executed in the same ISA. Sometimes one may want
    really small code; other times performance is more important.


    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test.

    Size is one thing, sooner or later one has to execute the instructions,
    and here My 66000needs to execute fewer, while being within spitting
    distance of code size.

    Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions,

    3 for My 66000

    so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four
    instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.
    * My 66000 uses ST immediate for globals

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Sep 10 13:31:58 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:

    Things could be architect-ed to allow a tradeoff between code size and number of instructions executed in the same ISA. Sometimes one may want really small code; other times performance is more important.

    That's what -Os vs. -O1, -O2, -O3 etc is about :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Sep 12 17:47:59 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:

    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.

    [...]

    gcc has -ftrampoline-impl=[stack|heap], see https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html

    Don't longjmp out of a nested function though.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 12 19:02:01 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Waldek Hebisch <antispam@fricas.org> schrieb:

    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.

    [...]

    gcc has -ftrampoline-impl=[stack|heap], see https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html

    Don't longjmp out of a nested function though.

    Or longjump around subroutines using 'new'.
    Or longjump out of 'signal' handlers.
    ...

    Somebody should write a paper entitled "longjump considered dangerous" annotating all the way it can be used to abuse compiler assumptions.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Sep 14 15:16:33 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Waldek Hebisch <antispam@fricas.org> schrieb:

    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.

    [...]

    gcc has -ftrampoline-impl=[stack|heap], see
    https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html

    Don't longjmp out of a nested function though.

    Or longjump around subroutines using 'new'.

    Actually, one must take care when calling longjmp
    from a function that allocates memory to ensure that
    the memory will be tracked and/or deallocated as
    and when required. That's perfectly feasible.

    Although in my experience, code that uses longjmp
    (say instead of C++ exceptions) is performance
    sensitive and performance sensitive code doesn't
    do dynamic allocation (i.e. high-performance
    C++ code won't use the standard C++ library
    functionality that requires dynamic allocation).


    Or longjump out of 'signal' handlers.

    Again, one must take the appropriate care. Such as
    using the correct API (e.g. POSIX siglongjmp(2)).

    It is quite common to use siglongjmp to leave
    a SIGINT (Control-C) handler.

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/siglongjmp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Sep 17 05:55:02 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Or longjump out of 'signal' handlers.

    Again, one must take the appropriate care. Such as
    using the correct API (e.g. POSIX siglongjmp(2)).

    The restrictions on siglongjmp() when jumping out of signal handlers
    are the same as for longjmp(). See the section "Application Usage" in https://pubs.opengroup.org/onlinepubs/9799919799/functions/longjmp.html

    The only difference is that sigsetjmp() saves the signal mask and
    siglongjmp() restores it.

    It is quite common to use siglongjmp to leave
    a SIGINT (Control-C) handler.

    In Gforth we certainly use longjmp() to leave the SIGINT handler as
    well as a number of synchronous-signal handlers in Gforth (Gforth does
    nothing with signal masks, so sigsetjmp()/siglongjmp() would make no difference). Of course, there is the possibility that the signal
    handler is invoked while some internal structure of an
    async-signal-unsafe function such as malloc() or fwrite() is in an
    inconsistent state, and then we will see breakage, but I have not seen
    that yet, and nobody has complained.

    The safe approach would be to set a flag in the SIGINT handler, and
    check that flag in safe places, but relatively frequently (e.g., just
    before a loop-back edge in the Forth program).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Sep 17 13:58:57 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Or longjump out of 'signal' handlers.

    Again, one must take the appropriate care. Such as
    using the correct API (e.g. POSIX siglongjmp(2)).

    The restrictions on siglongjmp() when jumping out of signal handlers
    are the same as for longjmp(). See the section "Application Usage" in >https://pubs.opengroup.org/onlinepubs/9799919799/functions/longjmp.html

    The only difference is that sigsetjmp() saves the signal mask and >siglongjmp() restores it.

    Which is important in any threaded application that uses
    signals.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Marco Moock@mm@dorfdsl.de to comp.arch,alt.folklore.computers on Sun Sep 21 16:20:19 2025
    From Newsgroup: comp.arch

    On 31.08.2025 16:43 Uhr Stefan Monnier wrote:

    Same here (tho I was on team Debian), but I don't think GNU/Linux
    enthusiasts were the main buyers of those Opteron and
    Athlon64 machines.

    Athlon 64 machines were mostly shipped with Windows XP 32 bit - even
    when XP 64 bit existed for that architecture.
    --
    kind regards
    Marco

    Send spam to 1756651406muell@stinkedores.dorfdsl.de

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Nuno Silva@nunojsilva@invalid.invalid to comp.arch,alt.folklore.computers on Sun Sep 21 15:45:52 2025
    From Newsgroup: comp.arch

    On 2025-09-21, Marco Moock wrote:

    On 31.08.2025 16:43 Uhr Stefan Monnier wrote:

    Same here (tho I was on team Debian), but I don't think GNU/Linux
    enthusiasts were the main buyers of those Opteron and
    Athlon64 machines.

    Athlon 64 machines were mostly shipped with Windows XP 32 bit - even
    when XP 64 bit existed for that architecture.

    Which one was NT 5.2 and not 5.1? XP for IA-64 or XP for amd64?
    --
    Nuno Silva
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Sun Sep 21 17:54:46 2025
    From Newsgroup: comp.arch

    On Sun, 21 Sep 2025 15:45:52 +0100
    Nuno Silva <nunojsilva@invalid.invalid> wrote:

    On 2025-09-21, Marco Moock wrote:

    On 31.08.2025 16:43 Uhr Stefan Monnier wrote:

    Same here (tho I was on team Debian), but I don't think GNU/Linux
    enthusiasts were the main buyers of those Opteron and
    Athlon64 machines.

    Athlon 64 machines were mostly shipped with Windows XP 32 bit - even
    when XP 64 bit existed for that architecture.

    Which one was NT 5.2 and not 5.1? XP for IA-64 or XP for amd64?


    XP for AMD64. The same version number as WS2003 https://en.wikipedia.org/wiki/List_of_Microsoft_Windows_versions


    --- Synchronet 3.21a-Linux NewsLink 1.2