• Stealing a Great Idea from the 6600

    From John Savard@quadibloc@servername.invalid to comp.arch on Wed Apr 17 15:19:03 2024
    From Newsgroup: comp.arch

    Not that I expect Mitch Alsup to approve!

    The 6600 had several I/O processors with a 12-bit word length that
    were really one processor, basicallty using SMT.

    Well, if I have a processor with an ISA that involves register banks
    of 32 registers each... an alternate instruction set involving
    register banks of 8 registers each would let me allocate either one
    compute thread or four threads with the I/O processor instruction set.

    And what would the I/O processor instruction set look like?

    Think of the PDP-11 or the 9900 but give more impiortance to
    floating-point. So I've come up with this format for a part of the
    instruction set:

    0 : 1 bit
    (First two bits of opcode: 00, 01, or 10 but not 11): 2 bits
    (remainder of opcode): 5 bits
    (mode, not 11): 2 bits
    (destination register): 3 bits
    (source register): 3 bits

    is the format of register-to-register instructions;

    but memory-to-register instructions are load-store:

    0: 1 bit
    (first two bits of opcode: 00, 01, or 10 but not 11): 2 bits
    (remainder of load/store opcode): 3 bits
    (base register): 2 bits
    (mode: 11): 2 bits
    (destination register): 3 bits
    (index register): 3 bits

    (displacement): 16 bits

    If the index register is zero, the instruction refers to memory, but
    is not indexed, as usual.

    An almost complete instruction set, using 3/8 of the available opcode
    space. Subroutine call and branch instructions, of course, are still
    also needed.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Apr 17 21:50:26 2024
    From Newsgroup: comp.arch

    John Savard <quadibloc@servername.invalid> writes:
    Not that I expect Mitch Alsup to approve!

    The 6600 had several I/O processors with a 12-bit word length that
    were really one processor, basicallty using SMT.

    Well, if I have a processor with an ISA that involves register banks
    of 32 registers each... an alternate instruction set involving
    register banks of 8 registers each would let me allocate either one
    compute thread or four threads with the I/O processor instruction set.

    And what would the I/O processor instruction set look like?

    On the Burroughs B4900, it looked a lot like an 8085. In fact,
    it was an 8085.


    Think of the PDP-11 or the 9900 but give more impiortance to
    floating-point.

    Why on earth would an I/O processor use floating point?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 17 23:32:20 2024
    From Newsgroup: comp.arch

    While I much admire CDC 6600 PPs and how much work those puppies did
    allowing the big number crunchers to <well> crunch numbers::

    With modern technology allowing 32-128 CPUs on a single die--there is
    no reason to limit the width of a PP to 12-bits (1965:: yes there was
    ample reason:: 2024 no reason whatsoever.) There is little reason to
    even do 32-bit PPs when it cost so little more to get a 64-bit core.

    In 2005-6 I was looking into a Verilog full x86-64 core {less FP} so
    that those smaller CPUs could run ISRs and kernel codes to offload the
    big CPUs from I/O duties. Done in Verilog meant anyone could compile it
    onto another die so the I/O CPUs were out on the PCIe tree nanoseconds
    away from the peripherals rather than microseconds away. Close enough
    to perform the DMA activities on behalf of the devices; and consuming interrupts so the bigger cores did not see any of them (except timer).

    As Scott stated:: there does not seem to be any reason to need FP on a
    core only doing I/O and kernel queueing services.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Wed Apr 17 21:14:39 2024
    From Newsgroup: comp.arch

    On Wed, 17 Apr 2024 23:32:20 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:


    With modern technology allowing 32-128 CPUs on a single die--there is
    no reason to limit the width of a PP to 12-bits (1965:: yes there was
    ample reason:: 2024 no reason whatsoever.) There is little reason to
    even do 32-bit PPs when it cost so little more to get a 64-bit core.

    Well, I'm not. The PP instruction set I propose uses 16-bit and 32-bit instructions, and so uses the same bus as the main instruction set.

    As Scott stated:: there does not seem to be any reason to need FP on a
    core only doing I/O and kernel queueing services.

    That's true.

    This isn't about cores, though. Instead, a core running the main ISA
    of the processor will simply have the option to replace one
    regular-ISA thread by four threads which use 8 registers instead of
    32, allowing SMT with more threads.

    So we're talking about the same core. The additional threads will get
    to execute instructions 1/4 as often as regular threads, so their
    performance is reduced, matching an ISA that gives them fewer
    registers.

    Since the design is reminiscent of the 6600 PPs, these threads might
    be used for I/O tasks, but nothing stops them from being used for
    other purposes for which access to the FP capabilities of the chip may
    be relevant.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Thu Apr 18 03:34:31 2024
    From Newsgroup: comp.arch

    On Wed, 17 Apr 2024 15:19:03 -0600, John Savard wrote:

    The 6600 had several I/O processors with a 12-bit word length that were really one processor, basicallty using SMT.

    Originally these “PPUs” (“Peripheral Processor Units”) were for running
    the OS, while the main CPU was primarily dedicated to running user
    programs.

    Aparently this idea did not work out so well, and in later versions of the
    OS, more code ran on the CPU instead of the PPUs.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 18 16:55:37 2024
    From Newsgroup: comp.arch

    John Savard wrote:

    On Wed, 17 Apr 2024 23:32:20 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:


    With modern technology allowing 32-128 CPUs on a single die--there is
    no reason to limit the width of a PP to 12-bits (1965:: yes there was
    ample reason:: 2024 no reason whatsoever.) There is little reason to
    even do 32-bit PPs when it cost so little more to get a 64-bit core.

    Well, I'm not. The PP instruction set I propose uses 16-bit and 32-bit instructions, and so uses the same bus as the main instruction set.

    As Scott stated:: there does not seem to be any reason to need FP on a
    core only doing I/O and kernel queueing services.

    That's true.

    This isn't about cores, though. Instead, a core running the main ISA
    of the processor will simply have the option to replace one
    regular-ISA thread by four threads which use 8 registers instead of
    32, allowing SMT with more threads.

    The hard thing is to run the Operating System in the PPs using the same compiled code in either a big core or in a little core. The big cores
    are on a CPU centric die, the little ones out on device oriented dies.
    In 7nm a MIPS R2000 is less than 0.07mm^2 using std cells. At this size
    every device can have its own core.

    So we're talking about the same core. The additional threads will get
    to execute instructions 1/4 as often as regular threads, so their
    performance is reduced, matching an ISA that gives them fewer
    registers.

    I knew you were talking about it that way, I was trying to get you to
    change your mind and use the same ISA in the device cores as you use
    in the CPU cores so you can run the same OS code and even a bit of the
    device drivers as well.

    Since the design is reminiscent of the 6600 PPs, these threads might
    be used for I/O tasks, but nothing stops them from being used for
    other purposes for which access to the FP capabilities of the chip may
    be relevant.

    Yes, exactly, and it is for those other purposes that you want these
    device cores to operate on the same ISA as the big cores. This way if
    anything goes wrong, you can simply lob the code back to a CPU centric
    core and finish the job.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 18 16:59:18 2024
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro wrote:

    On Wed, 17 Apr 2024 15:19:03 -0600, John Savard wrote:

    The 6600 had several I/O processors with a 12-bit word length that were
    really one processor, basicallty using SMT.

    Originally these “PPUs” (“Peripheral Processor Units”) were for running
    the OS,

    Including polling DMA performed by the PPS.

    while the main CPU was primarily dedicated to running user
    programs.

    Aparently this idea did not work out so well, and in later versions of the OS, more code ran on the CPU instead of the PPUs.

    Imagine that 10×12-bit CPUs, running 1/10 frequency of the main CPU, having
    a hard time performing OS workloads while the 50× faster CPU cores perform user workloads.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Thu Apr 18 23:42:15 2024
    From Newsgroup: comp.arch

    On Thu, 18 Apr 2024 16:55:37 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Yes, exactly, and it is for those other purposes that you want these
    device cores to operate on the same ISA as the big cores. This way if >anything goes wrong, you can simply lob the code back to a CPU centric
    core and finish the job.

    If the design has P-cores and E-cores, both will have the same *pair*
    of ISAs.

    Code written in the big ISA will run on both kinds of core, and code
    written in the little ISA will also run on both kinds of core, but use
    less resources on whichever core it is placed.

    So I won't have _that_ problem.

    Each core can just switch between compute duty with N threads, and I/O
    service duty with 4*N threads - or anywhere in between.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Fri Apr 19 01:38:45 2024
    From Newsgroup: comp.arch

    On Thu, 18 Apr 2024 23:42:15 -0600, John Savard
    <quadibloc@servername.invalid> wrote:


    Each core can just switch between compute duty with N threads, and I/O >service duty with 4*N threads - or anywhere in between.

    So I hope it is clear now I'm talking about SMT threads, not cores.
    Threads are orthogonal to cores.

    But I did make one oversimplification that could be confusing.

    The full instruction set assumes banks of 32 registers, one each for
    integer and floats, the reduced instruction set assumes banks of 8
    registers, one each for integer and floats.

    So one thread of the full ISA can be replaced by four threads of the
    reduced ISA, both use the same number of registes.

    That's all right for an in-order design. But in real life, computers
    are out-of-order. So the *rename* registers would have to be split up.

    Since the reduced ISA threads are four times greater in number, their instructions have four times longer to finish executing before their
    thread gets a chance to execute again. So presumably reduced ISA
    threads will need less agressive OoO, and 1/4 the rename registers
    might be adequate, but there's obviously no guarantee that this would
    indeed be an ideal fit.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 19 18:40:45 2024
    From Newsgroup: comp.arch

    John Savard wrote:

    On Thu, 18 Apr 2024 23:42:15 -0600, John Savard <quadibloc@servername.invalid> wrote:


    Each core can just switch between compute duty with N threads, and I/O >>service duty with 4*N threads - or anywhere in between.

    So I hope it is clear now I'm talking about SMT threads, not cores.
    Threads are orthogonal to cores.

    That was already clear.

    But I did make one oversimplification that could be confusing.

    The full instruction set assumes banks of 32 registers, one each for
    integer and floats, the reduced instruction set assumes banks of 8
    registers, one each for integer and floats.

    So one thread of the full ISA can be replaced by four threads of the
    reduced ISA, both use the same number of registes.

    So how does a 32-register thread "call" an 8 register thread ?? or vice
    versa ??

    What ABI model does the compiler use ??

    When an 8-register thread takes an exception, is it handled by a 8-reg
    thread or a 32-register thread ??

    That's all right for an in-order design. But in real life, computers
    are out-of-order. So the *rename* registers would have to be split up.

    In K9 we unified the x86 register files into a single file to simplify
    HW maintenance of the OoO state.

    Since the reduced ISA threads are four times greater in number, their instructions have four times longer to finish executing before their
    thread gets a chance to execute again.

    Now all that forwarding logic is wasting its gates of delay and area
    without adding any performance.

    Now all those instruction schedulers are sitting around doing nothing.

    So presumably reduced ISA
    threads will need less agressive OoO, and 1/4 the rename registers
    might be adequate, but there's obviously no guarantee that this would
    indeed be an ideal fit.

    LoL.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Sat Apr 20 01:06:33 2024
    From Newsgroup: comp.arch

    On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    So how does a 32-register thread "call" an 8 register thread ?? or vice
    versa ??

    That sort of thing would be done by supervisor mode instructions,
    similar to the ones used to start additional threads on a given core,
    or start threads on a new core.

    Since the lightweight ISA has the benefit of having fewer registers
    allocated, it's not the same as, slay, a "thumb mode" which offers
    more compact code as its benefit. Instead, this is for use in classes
    of threads that are separate from ordinary code.

    I/O processing threads being one example of this.

    The intent of this kind of lightweight ISA is to reduce the temptation
    to decide "oh, we've got to put special smaller cores in the SoC/on
    the motherboard to perform this specialized task, because the main CPU
    is overkill". Because now you're using a smaller slice of the main
    CPU, so it's not a waste to do it there any more.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Sat Apr 20 01:09:53 2024
    From Newsgroup: comp.arch

    On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    So presumably reduced ISA
    threads will need less agressive OoO, and 1/4 the rename registers
    might be adequate, but there's obviously no guarantee that this would
    indeed be an ideal fit.

    LoL.

    Well, yes. The fact that pretty much all serious high-performance
    designs these days _are_ OoO basically means that my brilliant idea is
    DoA.

    Of course, instead of replacing 1 full-ISA thread with 4 light-ISA
    threads, one could use a different number, based on what is optimum
    for a given implementation. But that ratio would now vary from one
    chip to another, being model-dependent.

    So it's not *totally* destroyed, but this is still a major blow.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Sat Apr 20 01:12:25 2024
    From Newsgroup: comp.arch

    On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    So presumably reduced ISA
    threads will need less agressive OoO, and 1/4 the rename registers
    might be adequate, but there's obviously no guarantee that this would
    indeed be an ideal fit.

    LoL.

    Well, yes. The fact that pretty much all serious high-performance
    designs these days _are_ OoO basically means that my brilliant idea is
    DoA.

    Of course, instead of replacing 1 full-ISA thread with 4 light-ISA
    threads, one could use a different number, based on what is optimum
    for a given implementation. But that ratio would now vary from one
    chip to another, being model-dependent.

    So it's not *totally* destroyed, but this is still a major blow.

    And, hey, I'm not the first guy to get sunk because of forgetting what
    lies under the tip of the iceberg that's above the water.

    That also happened to the captain of the _Titanic_.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 20 17:07:11 2024
    From Newsgroup: comp.arch

    John Savard wrote:

    On Sat, 20 Apr 2024 01:09:53 -0600, John Savard <quadibloc@servername.invalid> wrote:


    And, hey, I'm not the first guy to get sunk because of forgetting what
    lies under the tip of the iceberg that's above the water.

    That also happened to the captain of the _Titanic_.

    Concer-tina-tanic !?!

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sat Apr 20 15:14:07 2024
    From Newsgroup: comp.arch

    On 4/20/2024 12:07 PM, MitchAlsup1 wrote:
    John Savard wrote:

    On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
    <quadibloc@servername.invalid> wrote:


    And, hey, I'm not the first guy to get sunk because of forgetting what
    lies under the tip of the iceberg that's above the water.

    That also happened to the captain of the _Titanic_.

    Concer-tina-tanic !?!


    Seems about right.
    Seems like a whole lot of flailing with designs that seem needlessly complicated...



    Meanwhile, has looked around and noted:
    In some ways, RISC-V is sort of like MIPS with the field order reversed,
    and (ironically) actually smaller immediate fields (MIPS was using a lot
    of Imm16 fields. whereas RISC-V mostly used Imm12).

    But, seemed to have more wonk:
    A mode with 32x 32-bit GPRs;
    A mode with 32x 64-bit GPRs;
    Apparently a mode with 32x 32-bit GPRs that can be paired to 16x 64-bits
    as needed for 64-bit operations?...
    Integer operations (on 64-bit registers) that give UB or trap if values
    are outside of signed Int32 range;
    Other operations that sign-extend the values but are ironically called "unsigned" (apparently, similar wonk to RISC-V by having signed-extended Unsigned Int);
    Branch operations are bit-sliced;
    ...


    I had preferred a different strategy in some areas:
    Assume non-trapping operations by default;
    Sign-extend signed values, zero-extend unsigned values.

    Though, this is partly the source of some operations in my case assuming
    33 bit sign-extended: This can represent both the signed and unsigned
    32-bit ranges.

    One could argue that sign-extending both could save 1 bit in some cases.
    But, this creates wonk in other cases, such as requiring an explicit
    zero extension for "unsigned int" to "long long" casts; and more cases
    where separate instructions are needed for Int32 and Int64 cases (say,
    for example, RISC-V needed around 4x as many Int<->Float conversion
    operators due to its design choices in this area).

    Say:
    RV64:
    Int32<->Binary32, UInt32<->Binary32
    Int64<->Binary32, UInt64<->Binary32
    Int32<->Binary64, UInt32<->Binary64
    Int64<->Binary64, UInt64<->Binary64
    BJX2:
    Int64<->Binary64, UInt64<->Binary64

    With the Uint64 case mostly added because otherwise one needs a wonky
    edge case to deal with this (but is rare in practice).

    The separate 32-bit cases were avoided by tending to normalize
    everything to Binary64 in registers (with Binary32 only existing in SIMD
    form or in memory).

    Annoyingly, I did end up needing to add logic for all of these cases to
    deal with RV64G.


    Currently no plans to implement RISC-V's Privileged ISA stuff, mostly
    because it would likely be unreasonably expensive. It is in theory
    possible to write an OS to run in RISC-V mode, but it would need to deal
    with the different OS level and hardware-level interfaces (in much the
    same way, as I needed to use a custom linker script for GCC, as my stuff
    uses a different memory map from the one GCC had assumed; namely that of
    RAM starting at the 64K mark, rather than at the 16MB mark).



    In some cases in my case, there are distinctions between 32-bit and
    64-bit compare-and-branch ops. I am left thinking this distinction may
    be unnecessary, and one may only need 64 bit compare and branch.

    In the emulator, the current difference ended up mostly that the 32-bit version sees if the 32-bit and 64-bit version would give a different
    result and faulting if so, since this generally means that there is a
    bug elsewhere (such as other code is producing out-of-range values).

    For a few newer cases (such as the 3R compare ops, which produce a 1-bit output in a register), had only defined 64-bit versions.


    One could just ignore the distinction between 32 and 64 bit compare in hardware, but had still burnt the encoding space on this. In a new ISA
    design, I would likely drop the existence of 32-bit compare and use exclusively 64-bit compare.


    In many cases, the distinction between 32-bit and 64-bit operations, or between 2R and 3R cases, had ended up less significant than originally
    thought (and now have ended up gradually deprecating and disabling some
    of the 32-bit 2R encodings mostly due to "lack of relevance").

    Though, admittedly, part of the reason for a lot of separate 2R cases
    existing was that I had initially had the impression that there may have
    been a performance cost difference between 2R and 3R instructions. This
    ended up not really the case, as the various units ended up typically
    using 3R internally anyways.


    So, say, one needs an ALU with, say:
    2 inputs, one output;
    Ability to bit-invert the second input
    along with inverting carry-in, ...
    Ability to sign or zero extend the output.
    So, say, operations:
    ADD / SUB (Add, 64-bit)
    ADDSL / SUBSL (Add, 32-bit, sign extent)
    ADDUL / SUBUL (Add, 32-bit, zero extent)
    AND
    OR
    XOR
    CMPEQ
    CMPNE
    CMPGT (CMPLT implicit)
    CMPGE (CMPLE implicit)
    CMPHI (unsigned GT)
    CMPHS (unsigned GE)
    ...

    Where, internally compare works by performing a subtract and then
    producing a result based on some status bits (Z,C,S,O). As I see it,
    ideally these bits should not be exposed at the ISA level though (much
    pain and hair results from the existence of architecturally visible ALU status-flag bits).


    Some other features could still be debated though, along with how much simplification could be possible.

    If I did a new design, would probably still keep predication and jumbo prefixes.

    Explicit bundling vs superscalar could be argued either way, as
    superscalar isn't as expensive as initially thought, but in a simpler
    form is comparably weak (the compiler has an advantage that it can
    invest more expensive analysis into this, reorder instructions, etc; but
    this only goes so far as the compiler understands the CPU's pipeline,
    ties the code to a specific pipeline structure, and becomes effectively
    moot with OoO CPU designs).

    So, a case could be made that a "general use" ISA be designed without
    the use of explicit bundling. In my case, using the bundle flags also
    requires the code to use an instruction to signal to the CPU what configuration of pipeline it expects to run on, with the CPU able to
    fall back to scalar (or superscalar) execution if it does not match.

    For the most part, thus far nearly everything has ended up as "Mode 2", namely:
    3 lanes;
    Lane 1 does everything;
    Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
    Lane 3 only does Basic ALU ops and a few CONV ops and similar.
    Lane 3 originally also did Shift, dropped to reduce cost.
    Mem ops may eat Lane 3, ...

    Where, say:
    Mode 0 (Default):
    Only scalar code is allowed, CPU may use superscalar (if available).
    Mode 1:
    2 lanes:
    Lane 1 does everything;
    Lane 2 does ALU, Shift, and CONV.
    Mem ops take up both lanes.
    Effectively scalar for Load/Store.
    Later defined that 128-bit MOV.X is allowed in a Mode 1 core.

    Had defined wider modes, and ones that allow dual-lane IO and FPU instructions, but these haven't seen use (too expensive to support in hardware).

    Had ended up with the ambiguous "extension" to the Mode 2 rules of
    allowing an FPU instruction to be executed from Lane 2 if there was not
    an FPU instruction in Lane 1, or allowing co-issuing certain FPU
    instructions if they effectively combine into a corresponding SIMD op.

    In my current configurations, there is only a single memory access port.
    A second memory access port would help with performance, but is
    comparably a rather expensive feature (and doesn't help enough to
    justify its fairly steep cost).

    For lower-end cores, a case could be made for assuming a 1-wide CPU with
    a 2R1W register file, but designing the whole ISA around this limitation
    and not allowing for anything more is limiting (and mildly detrimental
    to performance). If we can assume cores with an FPU, we can probably
    also assume cores with more than two register read ports available.

    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 20 22:03:21 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/20/2024 12:07 PM, MitchAlsup1 wrote:
    John Savard wrote:

    On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
    <quadibloc@servername.invalid> wrote:


    And, hey, I'm not the first guy to get sunk because of forgetting what
    lies under the tip of the iceberg that's above the water.

    That also happened to the captain of the _Titanic_.

    Concer-tina-tanic !?!


    Seems about right.
    Seems like a whole lot of flailing with designs that seem needlessly complicated...



    Meanwhile, has looked around and noted:
    In some ways, RISC-V is sort of like MIPS with the field order reversed,

    They, in effect, Litle-Endian-ed the fields.

    and (ironically) actually smaller immediate fields (MIPS was using a lot
    of Imm16 fields. whereas RISC-V mostly used Imm12).

    Yes, RISC-V took a step back with the 12-bit immediates. My 66000, on
    the other hand, only has 12-bit immediates for shift instructions--
    allowing all shifts to reside in one Major OpCode; the rest inst[31]=1
    have 16-bit immediates (universally sign extended).

    But, seemed to have more wonk:
    A mode with 32x 32-bit GPRs; // unnecessary
    A mode with 32x 64-bit GPRs;
    Apparently a mode with 32x 32-bit GPRs that can be paired to 16x 64-bits
    as needed for 64-bit operations?...

    Repeating the mistake I made on Mc 88100....

    Integer operations (on 64-bit registers) that give UB or trap if values
    are outside of signed Int32 range;

    Isn't it just wonderful ??

    Other operations that sign-extend the values but are ironically called "unsigned" (apparently, similar wonk to RISC-V by having signed-extended Unsigned Int);
    Branch operations are bit-sliced;
    ....


    I had preferred a different strategy in some areas:
    Assume non-trapping operations by default;

    Assume trap/"do the expected thing" under a user accessible flag.

    Sign-extend signed values, zero-extend unsigned values.

    Another mistake I mad in Mc 88100.

    Do you sign extend the 16-bit displacement on an unsigned LD ??

    Though, this is partly the source of some operations in my case assuming
    33 bit sign-extended: This can represent both the signed and unsigned
    32-bit ranges.

    These are some of the reasons My 66000 is 64-bit register/calculation only.

    One could argue that sign-extending both could save 1 bit in some cases. But, this creates wonk in other cases, such as requiring an explicit
    zero extension for "unsigned int" to "long long" casts; and more cases
    where separate instructions are needed for Int32 and Int64 cases (say,
    for example, RISC-V needed around 4x as many Int<->Float conversion operators due to its design choices in this area).

    It also gets difficult when you consider EADD Rd,Rdouble,Rexponent ??
    is it a FP calculation or an integer calculation ?? If Rdouble is a
    constant is the constant FP or int, if Rexponent is a constant is it
    double or int,..... Does it raise FP overflow or integer overflow ??

    Say:
    RV64:
    Int32<->Binary32, UInt32<->Binary32
    Int64<->Binary32, UInt64<->Binary32
    Int32<->Binary64, UInt32<->Binary64
    Int64<->Binary64, UInt64<->Binary64
    BJX2:
    Int64<->Binary64, UInt64<->Binary64
    My 66000:
    int64_t -> { uint64_t, float, double }
    uint64_t -> { int64_t, float, double }
    float -> { uint64_t, int64_t, double }
    double -> { uint64_t, int64_t, float }

    With the Uint64 case mostly added because otherwise one needs a wonky
    edge case to deal with this (but is rare in practice).

    The separate 32-bit cases were avoided by tending to normalize
    everything to Binary64 in registers (with Binary32 only existing in SIMD form or in memory).

    I saved LD and ST instructions by leaving float 32-bits in the registers.

    Annoyingly, I did end up needing to add logic for all of these cases to
    deal with RV64G.

    No rest for the wicked.....

    Currently no plans to implement RISC-V's Privileged ISA stuff, mostly because it would likely be unreasonably expensive.

    The sea of control registers or the sequencing model applied thereon ??
    My 66000 allows access to all control registers via memory mapped I/O
    space.

    It is in theory
    possible to write an OS to run in RISC-V mode, but it would need to deal with the different OS level and hardware-level interfaces (in much the
    same way, as I needed to use a custom linker script for GCC, as my stuff uses a different memory map from the one GCC had assumed; namely that of
    RAM starting at the 64K mark, rather than at the 16MB mark).



    In some cases in my case, there are distinctions between 32-bit and
    64-bit compare-and-branch ops. I am left thinking this distinction may
    be unnecessary, and one may only need 64 bit compare and branch.

    No 32-bit stuff, thereby no 32-bit distinctions needed.

    In the emulator, the current difference ended up mostly that the 32-bit version sees if the 32-bit and 64-bit version would give a different
    result and faulting if so, since this generally means that there is a
    bug elsewhere (such as other code is producing out-of-range values).

    Saving vast amounts of power {{{not}}}

    For a few newer cases (such as the 3R compare ops, which produce a 1-bit output in a register), had only defined 64-bit versions.

    Oh what a tangled web we.......

    One could just ignore the distinction between 32 and 64 bit compare in hardware, but had still burnt the encoding space on this. In a new ISA design, I would likely drop the existence of 32-bit compare and use exclusively 64-bit compare.


    In many cases, the distinction between 32-bit and 64-bit operations, or between 2R and 3R cases, had ended up less significant than originally thought (and now have ended up gradually deprecating and disabling some
    of the 32-bit 2R encodings mostly due to "lack of relevance").

    I deprecated all of them.

    Though, admittedly, part of the reason for a lot of separate 2R cases existing was that I had initially had the impression that there may have been a performance cost difference between 2R and 3R instructions. This ended up not really the case, as the various units ended up typically
    using 3R internally anyways.


    So, say, one needs an ALU with, say:
    2 inputs, one output;
    you forgot carry, and inversion to perform subtraction.
    Ability to bit-invert the second input
    along with inverting carry-in, ...
    Ability to sign or zero extend the output.

    So, My 66000 integer adder has 3 carry inputs, and I discovered a way to perform these that takes no more gates of delay than the typical 1-carry
    in 64-bit integer adder. This gives me a = -b -c; for free.

    So, say, operations:
    ADD / SUB (Add, 64-bit)
    ADDSL / SUBSL (Add, 32-bit, sign extent) // nope
    ADDUL / SUBUL (Add, 32-bit, zero extent) // nope
    AND
    OR
    XOR
    CMPEQ // 1 ICMP inst
    CMPNE
    CMPGT (CMPLT implicit)
    CMPGE (CMPLE implicit)
    CMPHI (unsigned GT)
    CMPHS (unsigned GE)
    ....

    Where, internally compare works by performing a subtract and then
    producing a result based on some status bits (Z,C,S,O). As I see it,
    ideally these bits should not be exposed at the ISA level though (much
    pain and hair results from the existence of architecturally visible ALU status-flag bits).

    I agree that these flags should not be exposed through ISA; and I did not.
    On the other hand multi-precision arithmetic demands at least carry {or
    some other means which is even more powerful--such as CARRY.....}

    Some other features could still be debated though, along with how much simplification could be possible.

    If I did a new design, would probably still keep predication and jumbo prefixes.

    I kept predication but not the way most predication works.
    My work on Mc 88120 and K9 taught me the futility of things in the
    instruction stream that provide artificial boundaries. I have a suspicion
    that if you have the FPGA capable of allowing you to build a 8-wide
    machine, you would do the jumbo stuff differently, too.

    Explicit bundling vs superscalar could be argued either way, as
    superscalar isn't as expensive as initially thought, but in a simpler
    form is comparably weak (the compiler has an advantage that it can
    invest more expensive analysis into this, reorder instructions, etc; but this only goes so far as the compiler understands the CPU's pipeline,

    Compilers are notoriously unable to outguess a good branch predictor.

    ties the code to a specific pipeline structure, and becomes effectively
    moot with OoO CPU designs).

    OoO exists, in a practical sense, to abstract the pipeline out of the compiler; or conversely, to allow multiple implementations to run the
    same compiled code optimally on each implementation.

    So, a case could be made that a "general use" ISA be designed without
    the use of explicit bundling. In my case, using the bundle flags also requires the code to use an instruction to signal to the CPU what configuration of pipeline it expects to run on, with the CPU able to
    fall back to scalar (or superscalar) execution if it does not match.

    Sounds like a bridge too far for your 8-wide GBOoO machine.

    For the most part, thus far nearly everything has ended up as "Mode 2", namely:
    3 lanes;
    Lane 1 does everything;
    Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
    Lane 3 only does Basic ALU ops and a few CONV ops and similar.
    Lane 3 originally also did Shift, dropped to reduce cost.
    Mem ops may eat Lane 3, ...

    Try 6-lanes:
    1,2,3 Memory ops + integer ADD and Shifts
    4 FADD ops + integer ADD and FMisc
    5 FMAC ops + integer ADD
    6 CMP-BR ops + integer ADD

    Where, say:
    Mode 0 (Default):
    Only scalar code is allowed, CPU may use superscalar (if available).
    Mode 1:
    2 lanes:
    Lane 1 does everything;
    Lane 2 does ALU, Shift, and CONV.
    Mem ops take up both lanes.
    Effectively scalar for Load/Store.
    Later defined that 128-bit MOV.X is allowed in a Mode 1 core.
    Modeless.

    Had defined wider modes, and ones that allow dual-lane IO and FPU instructions, but these haven't seen use (too expensive to support in hardware).

    Had ended up with the ambiguous "extension" to the Mode 2 rules of
    allowing an FPU instruction to be executed from Lane 2 if there was not
    an FPU instruction in Lane 1, or allowing co-issuing certain FPU instructions if they effectively combine into a corresponding SIMD op.

    In my current configurations, there is only a single memory access port.

    This should imply that your 3-wide pipeline is running at 90%-95%
    memory/cache saturation.

    A second memory access port would help with performance, but is
    comparably a rather expensive feature (and doesn't help enough to
    justify its fairly steep cost).

    For lower-end cores, a case could be made for assuming a 1-wide CPU with
    a 2R1W register file, but designing the whole ISA around this limitation
    and not allowing for anything more is limiting (and mildly detrimental
    to performance). If we can assume cores with an FPU, we can probably
    also assume cores with more than two register read ports available.

    If you design around the notion of a 3R1W register file, FMAC and INSERT
    fall out of the encoding easily. Done right, one can switch it into a 4R
    or 4W register file for ENTER and EXIT--lessening the overhead of call/ret.

    ....
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Sat Apr 20 17:59:12 2024
    From Newsgroup: comp.arch

    On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    BGB wrote:

    Sign-extend signed values, zero-extend unsigned values.

    Another mistake I mad in Mc 88100.

    As that is a mistake the IBM 360 made, I make it too. But I make it
    the way the 360 did: there are no signed and unsigned values, in the
    sense of a Burroughs machine, there are just Load, Load Unsigned - and
    Insert - instructions.

    Index and base register values are assumed to be unsigned.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Sat Apr 20 18:01:49 2024
    From Newsgroup: comp.arch

    On Sat, 20 Apr 2024 01:06:33 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    So how does a 32-register thread "call" an 8 register thread ?? or vice >>versa ??

    That sort of thing would be done by supervisor mode instructions,
    similar to the ones used to start additional threads on a given core,
    or start threads on a new core.

    Since the lightweight ISA has the benefit of having fewer registers >allocated, it's not the same as, slay, a "thumb mode" which offers
    more compact code as its benefit. Instead, this is for use in classes
    of threads that are separate from ordinary code.

    I/O processing threads being one example of this.

    Of course, though, there's nothing preventing using the lightweight
    ISA as the basic for something that _could_ interoperate with the full
    ISA. Keep all 32 registers in each bank, and have a sliding 8-register
    window, or use bundles of instructions, say up to seven instructions,
    using one of three groups of eight integer registers and one of four
    groups of floating-point registers. (The fourth group of integer
    registers is the base registers.)

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Sat Apr 20 18:06:22 2024
    From Newsgroup: comp.arch

    On Sat, 20 Apr 2024 17:59:12 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    BGB wrote:

    Sign-extend signed values, zero-extend unsigned values.

    Another mistake I mad in Mc 88100.

    As that is a mistake the IBM 360 made, I make it too. But I make it
    the way the 360 did: there are no signed and unsigned values, in the
    sense of a Burroughs machine, there are just Load, Load Unsigned - and
    Insert - instructions.

    Since there was only one set of arithmetic instrucions, that meant
    that when you wrote code to operate on unsigned values, you had to
    remember that the normal names of the condition code values were
    oriented around signed arithmetic.

    So during unsigned arithmetic, "overflow" didn't _mean_ overflow.
    Instead, carry was overflow.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 21 00:43:21 2024
    From Newsgroup: comp.arch

    John Savard wrote:

    On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    BGB wrote:

    Sign-extend signed values, zero-extend unsigned values.

    Another mistake I mad in Mc 88100.

    As that is a mistake the IBM 360 made, I make it too. But I make it
    the way the 360 did: there are no signed and unsigned values, in the
    sense of a Burroughs machine, there are just Load, Load Unsigned - and
    Insert - instructions.

    Index and base register values are assumed to be unsigned.

    I would use the term signless as opposed to unsigned.

    Address arithmetic is ADD only and does not care about signs or
    overflow. There is no concept of a negative base register or a
    negative index register (or, for that matter, a negative displace-
    ment), overflow, underflow, carry, ...

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 21 02:17:39 2024
    From Newsgroup: comp.arch

    On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/20/2024 12:07 PM, MitchAlsup1 wrote:
    John Savard wrote:

    On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
    <quadibloc@servername.invalid> wrote:


    And, hey, I'm not the first guy to get sunk because of forgetting what >>>> lies under the tip of the iceberg that's above the water.

    That also happened to the captain of the _Titanic_.

    Concer-tina-tanic !?!


    Seems about right.
    Seems like a whole lot of flailing with designs that seem needlessly
    complicated...



    Meanwhile, has looked around and noted:
    In some ways, RISC-V is sort of like MIPS with the field order reversed,

    They, in effect, Litle-Endian-ed the fields.


    Yeah.


    and (ironically) actually smaller immediate fields (MIPS was using a
    lot of Imm16 fields. whereas RISC-V mostly used Imm12).

    Yes, RISC-V took a step back with the 12-bit immediates. My 66000, on
    the other hand, only has 12-bit immediates for shift instructions--
    allowing all shifts to reside in one Major OpCode; the rest inst[31]=1
    have 16-bit immediates (universally sign extended).


    I had gone further and used mostly 9/10 bit fields (mostly expanded to
    10/12 in XG2).


    I don't really think this is a bad choice in a statistical sense (as it
    so happens, most of the immediate values can fit into a 9-bit field,
    without going too far into "diminishing returns" territory).

    Ended up with some inconsistency when expanding to 10 bits:
    Displacements went 9u -> 10s
    ADD/SUB: 9u/9n -> 10u/10n
    AND: 9u -> 10s
    OR,XOR: 9u -> 10u

    And was initially 9u->10u (like OR and XOR), but changed over at the
    last minute:
    Negative masks were far more common than 10-bit masks;
    At the moment, the change didn't seem to break anything;
    I didn't really have any other encoding space to put this.
    The main "sane" location to put it was already taken by RSUB;
    The Imm9 space is basically already full.

    With OR and XOR, negative masks are essentially absent, so switching
    these to signed would not make sense; even if this breaks the symmetry
    between AND/OR/XOR.


    But, seemed to have more wonk:
    A mode with 32x 32-bit GPRs; // unnecessary
    A mode with 32x 64-bit GPRs;
    Apparently a mode with 32x 32-bit GPRs that can be paired to 16x
    64-bits as needed for 64-bit operations?...

    Repeating the mistake I made on Mc 88100....


    I had saw a video talking about the Nintendo 64, and it was saying that
    the 2x paired 32-bit register mode was used more often than the native
    64-bit mode, as the native 64-bit mode was slower as apparently it
    couldn't fully pipeline the 64-bit ops, so using it in this mode came at
    a performance hit (vs using it to run glorified 32-bit code).


    Integer operations (on 64-bit registers) that give UB or trap if
    values are outside of signed Int32 range;

    Isn't it just wonderful ??


    No direct equivalent in my case, nor any desire to add these.

    Preferable I think if the behavior of instructions is consistent across implementations, though OTOH can claim strict 1:1 between my Verilog implementation and emulator, but at least I try to keep things consistent.

    Though, things fall short of strict 100% consistency between the Verilog implementation and emulator (usually in cases where the emulator will
    trap, but the Verilog implementation will "do whatever").

    Though, in part, this is because the emulator serves the secondary
    purpose of linting the compiler output.

    Though, partly it is a case of, not even trapping is entirely free.


    Other operations that sign-extend the values but are ironically called
    "unsigned" (apparently, similar wonk to RISC-V by having
    signed-extended Unsigned Int);
    Branch operations are bit-sliced;
    ....


    I had preferred a different strategy in some areas:
       Assume non-trapping operations by default;

    Assume trap/"do the expected thing" under a user accessible flag.

    Most are defined in ways that I feel are sensible.

    For ALU this means one of:
    64-bit result;
    Sign-extended from 32-bit result;
    Zero extended from 32-bit result.


       Sign-extend signed values, zero-extend unsigned values.

    Another mistake I mad in Mc 88100.

    Do you sign extend the 16-bit displacement on an unsigned LD ??


    In my case; for the Baseline encoding, Ld/St displacements were unsigned
    only.

    For XG2, they are signed. It was a tight call, but the sign-extended
    case won out by an admittedly thin margin in this case.

    Granted, this means that the Load/Store ops with a Disp5u/Disp6s
    encodings are mostly redundant in XG2, but are the only way to directly
    encode negative displacements in Baseline+XGPR (in pure Baseline,
    negative Ld/St displacements being N/E).


    But, as for values in registers, I personally feel that my scheme (as a
    direct extension of the scheme that C itself seems to use) works better
    than the one used by MIPS and RISC-V, which seems needlessly wonky with
    a bunch of edge cases (that end up ultimately requiring the ISA to care
    more about the size and type of the value rather than less).

    Then again, x86-64 and ARM64 went the other direction (always zero
    extending the 32-bit values).

    Then again, it seems like a case where spending more in one area can
    save cost in others.


    Though, this is partly the source of some operations in my case
    assuming 33 bit sign-extended: This can represent both the signed and
    unsigned 32-bit ranges.

    These are some of the reasons My 66000 is 64-bit register/calculation only.


    It is a tradeoff.

    Many operations are full 64-bit.


    Load/Store and Branch displacements have tended to be 33 bit to save
    cost over 48 bit displacements (with a 48-bit address space, with
    16-bits for optional type-tags or similar).

    Though, this does theoretically add a penalty if "size_t" or "long" or
    similar is used as an array index (rather than "int" or smaller), since
    in this case the compiler will need to fall back to ALU operations to
    perform the index operation (similar to what typically happens for array indexing on RISC-V).

    Mostly not a huge issue, as pretty much all the code seems to use 'int'
    for array indices.


    Optionally, can enable the use of 48-bit displacements, but not really
    worth much if they are not being used (similar issue for the 96-bit
    addressing thing).


    Even 48-bits is overkill when one can fit the entirety of both RAM and secondary storage into the address space.



    Kind of a very different situation from 16-bit days, where people were engaging in wonk to try to fit in more RAM than they had address space...

    Well, nevermind a video where a guy managed to get a 486 PC working with
    no SIMM's, only apparently some on-board RAM on the MOBO, and some ISA RAM-expansion cards (apparently intended for the 80286).

    Apparently he was getting Doom framerates (in real-time) almost on-par
    with what I am seeing in Verilog simulations (roughly 11 seconds per
    frame at the moment; simulation running approximately 250x slower than real-time).


    One could argue that sign-extending both could save 1 bit in some
    cases. But, this creates wonk in other cases, such as requiring an
    explicit zero extension for "unsigned int" to "long long" casts; and
    more cases where separate instructions are needed for Int32 and Int64
    cases (say, for example, RISC-V needed around 4x as many Int<->Float
    conversion operators due to its design choices in this area).

    It also gets difficult when you consider EADD Rd,Rdouble,Rexponent ??
    is it a FP calculation or an integer calculation ?? If Rdouble is a
    constant is the constant FP or int, if Rexponent is a constant is it
    double or int,..... Does it raise FP overflow or integer overflow ??


    Dunno, neither RISC-V nor BJX2 has this...



    Say:
       RV64:
         Int32<->Binary32, UInt32<->Binary32
         Int64<->Binary32, UInt64<->Binary32
         Int32<->Binary64, UInt32<->Binary64
         Int64<->Binary64, UInt64<->Binary64
       BJX2:
         Int64<->Binary64, UInt64<->Binary64
        My 66000:
          int64_t  -> { uint64_t, float,   double }
          uint64_t -> {  int64_t, float,   double }
          float    -> { uint64_t, int64_t, double }
          double   -> { uint64_t, int64_t, float  }


    I originally just had two instructions (FLDCI and FSTCI), but gave in an
    added more, because, say:
    MOV 0x8000000000000000, R3
    TEST.Q R4, R3
    SHLD.Q?F R4, -1, R4
    FLDCI R4, R2
    FADD?F R2, R2, R2

    Is more wonk than ideal...

    Technically, also the logic (for the unsigned variants) had already been
    added to the CPU core for sake of RV64G.

    Technically, the logic for Int32 and Binary32 cases exists, but I feel
    less incentive to add them for sake of:
    All the scalar math is being done as Binary64;
    With the sign/zero extension scheme, separate 32-bit forms don't add much.


    With the Uint64 case mostly added because otherwise one needs a wonky
    edge case to deal with this (but is rare in practice).

    The separate 32-bit cases were avoided by tending to normalize
    everything to Binary64 in registers (with Binary32 only existing in
    SIMD form or in memory).

    I saved LD and ST instructions by leaving float 32-bits in the registers.


    I had originally went with just using a 32-bit load/store, along with a (Binary32<->Binary64) conversion instruction.

    Eventually went and added FMOV.S as a combined Load/Store and Convert,
    mostly because MOV.L+FLDCF was not ideal for performance in programs
    like Quake (where "float" is not exactly a rarely used type).


    On a low-cost core, most sensible option is to fall back to explicit
    convert (at least, on cores that can still afford an FPU).

    But, as I see it, LD/ST with built in convert is likely a cheaper option
    than making every other FPU related instruction need to deal with every floating-point format.


    Annoyingly, I did end up needing to add logic for all of these cases
    to deal with RV64G.

    No rest for the wicked.....


    It is a bit wonky, as I dealt with the scalar Binary32 ops for RV mostly
    by routing them through the logic for the SIMD ops. At least as far as
    most code should be concerned, it is basically the same (even if it does technically deviate from the RV64 spec, which defines the high bits of
    the register as encoding a NaN).

    Technically does allow for a very lazy FP-SIMD extension to RV64G
    (doesn't even require adding any new instructions...).



    Currently no plans to implement RISC-V's Privileged ISA stuff, mostly
    because it would likely be unreasonably expensive.

    The sea of control registers or the sequencing model applied thereon ??
    My 66000 allows access to all control registers via memory mapped I/O
    space.


    That, and the need for 3+ copies of the register file (for each
    operating mode), and the need for a hardware page-table walker, ...



                                                        It is in theory
    possible to write an OS to run in RISC-V mode, but it would need to
    deal with the different OS level and hardware-level interfaces (in
    much the same way, as I needed to use a custom linker script for GCC,
    as my stuff uses a different memory map from the one GCC had assumed;
    namely that of RAM starting at the 64K mark, rather than at the 16MB
    mark).



    In some cases in my case, there are distinctions between 32-bit and
    64-bit compare-and-branch ops. I am left thinking this distinction may
    be unnecessary, and one may only need 64 bit compare and branch.

    No 32-bit stuff, thereby no 32-bit distinctions needed.

    In the emulator, the current difference ended up mostly that the
    32-bit version sees if the 32-bit and 64-bit version would give a
    different result and faulting if so, since this generally means that
    there is a bug elsewhere (such as other code is producing out-of-range
    values).

    Saving vast amounts of power {{{not}}}


    For the Verilog version, option is more like:
    Just always do the 64-bit version.

    Nevermind that it has wasted 1 bit of encoding entropy on being able to specify 32 and 64 bit compare, in cases when it doesn't actually matter...

    It wasn't until some time later (after originally defining the
    encodings), that I started to realize how much it didn't matter.


    For a few newer cases (such as the 3R compare ops, which produce a
    1-bit output in a register), had only defined 64-bit versions.

    Oh what a tangled web we.......


    In "other ISA", these would be given different names:
    SLT, SLTU, SLTI, SLTIU, ...


    But, I ended up with:
    CMPQEQ Rm, Ro, Rn
    CMPQNE Rm, Ro, Rn
    CMPQGT Rm, Ro, Rn
    CMPQGE Rm, Ro, Rn

    CMPQEQ Rm, Imm5u/Imm6s, Rn
    CMPQNE Rm, Imm5u/Imm6s, Rn
    CMPQGT Rm, Imm5u/Imm6s, Rn
    CMPQLT Rm, Imm5u/Imm6s, Rn

    Though, SLT and CMPQGT are basically the same operation, just
    conceptually with the inputs flipped.

    The Imm5u/Imm6s (5u in Baseline, 6s in XG2) forms differ slightly in
    that one can't flip the arguments, but the difference between GT and GE
    is subtracting 1 from the immediate...


    I am also left deciding if the modified XG2 jumbo prefix rules (that
    hacked F2 block instructions to Imm64) should be applied to the F0 block
    to effectively extend Imm29s to Imm33s.


    One could just ignore the distinction between 32 and 64 bit compare in
    hardware, but had still burnt the encoding space on this. In a new ISA
    design, I would likely drop the existence of 32-bit compare and use
    exclusively 64-bit compare.


    In many cases, the distinction between 32-bit and 64-bit operations,
    or between 2R and 3R cases, had ended up less significant than
    originally thought (and now have ended up gradually deprecating and
    disabling some of the 32-bit 2R encodings mostly due to "lack of
    relevance").

    I deprecated all of them.


    The vast majority of the 2R ops are things like "Convert A into B" or
    similar.

    But:
    ADD Rm, Rn
    Is kinda moot:
    ADD Rn, Rm, Rn
    Does the same thing.

    Though:
    MOV Rm, Rn
    EXTS.L Rm, Rn
    EXTU.L Rm, Rn
    Could almost be turned into:
    ADD Rm, 0, Rn
    ADDS.L Rm, 0, Rn
    ADDU.L Rm, 0, Rn
    Nevermind any potential wonk involved to maintain a 1 cycle latency.


    Or, given the large numbers of these instructions in my compiler's
    output, MOV or EXTS.L dropping to a 2-cycle latency has a fairly obvious impact on performance.


    Though, admittedly, part of the reason for a lot of separate 2R cases
    existing was that I had initially had the impression that there may
    have been a performance cost difference between 2R and 3R
    instructions. This ended up not really the case, as the various units
    ended up typically using 3R internally anyways.


    So, say, one needs an ALU with, say:
       2 inputs, one output;
    you forgot carry, and inversion to perform subtraction.
       Ability to bit-invert the second input
         along with inverting carry-in, ...
       Ability to sign or zero extend the output.

    So, My 66000 integer adder has 3 carry inputs, and I discovered a way to perform these that takes no more gates of delay than the typical 1-carry
    in 64-bit integer adder. This gives me a = -b -c; for free.


    The ALU design in my case does not support inverting arbitrary inputs,
    only doing ADD/SUB, in various forms.

    The Lane 1 ALU also does CMP and a bunch of CONV stuff, whereas the Lane
    2/3 ALUs are more minimal.


    So, say, operations:
       ADD / SUB (Add, 64-bit)
       ADDSL / SUBSL (Add, 32-bit, sign extent)  // nope
       ADDUL / SUBUL (Add, 32-bit, zero extent)  // nope
       AND
       OR
       XOR
       CMPEQ                                     // 1 ICMP inst
       CMPNE
       CMPGT (CMPLT implicit)
       CMPGE (CMPLE implicit)
       CMPHI (unsigned GT)
       CMPHS (unsigned GE)
    ....

    Where, internally compare works by performing a subtract and then
    producing a result based on some status bits (Z,C,S,O). As I see it,
    ideally these bits should not be exposed at the ISA level though (much
    pain and hair results from the existence of architecturally visible
    ALU status-flag bits).

    I agree that these flags should not be exposed through ISA; and I did not.
    On the other hand multi-precision arithmetic demands at least carry {or
    some other means which is even more powerful--such as CARRY.....}


    Yeah...

    I just sorta carried over the same old ADDC/SUBC instructions from SH.
    Never got upgraded to 3R forms either.


    Some other features could still be debated though, along with how much
    simplification could be possible.

    If I did a new design, would probably still keep predication and jumbo
    prefixes.

    I kept predication but not the way most predication works.
    My work on Mc 88120 and K9 taught me the futility of things in the instruction stream that provide artificial boundaries. I have a suspicion that if you have the FPGA capable of allowing you to build a 8-wide
    machine, you would do the jumbo stuff differently, too.


    Probably.

    When I considered a WEX-6W design, a lot of stuff for how things work in
    2W or 3W configurations did not scale. I had considered a different way
    for how bundling would work and how prefixes would work, and effectively
    broke the symmetry between scalar execution and bundled execution.

    The idea for WEX-6W was quickly abandoned when it started to become
    obvious that a single 6-wide core would be more expensive than two
    3-wide cores (at least, absent more drastic measures like partitioning
    the register space and/or eliminating the use of register forwarding).

    In effect, it likely would have either ended up looking like a more conventional "true" VLIW; or so expensive that it blows out the LUT
    budget on the XC7A100T.




    Explicit bundling vs superscalar could be argued either way, as
    superscalar isn't as expensive as initially thought, but in a simpler
    form is comparably weak (the compiler has an advantage that it can
    invest more expensive analysis into this, reorder instructions, etc;
    but this only goes so far as the compiler understands the CPU's pipeline,

    Compilers are notoriously unable to outguess a good branch predictor.


    Errm, assuming the compiler is capable of things like general-case
    inlining and loop-unrolling.

    I was thinking of simpler things, like shuffling operators between
    independent (sub)expressions to limit the number of register-register dependencies.

    Like, in-order superscalar isn't going to do crap if nearly every
    instruction depends on every preceding instruction. Even pipelining
    can't help much with this.


    The compiler can shuffle the instructions into an order to limit the
    number of register dependencies and better fit the pipeline. But, then,
    most of the "hard parts" are already done (so it doesn't take much more
    for the compiler to flag which instructions can run in parallel).


    Meanwhile, a naive superscalar may miss cases that could be run in
    parallel, if it is evaluating the rules "coarsely" (say, evaluating what
    is safe or not safe to run things in parallel based on general groupings
    of opcodes rather than the rules of specific opcodes; or, say,
    false-positive register alias if, say, part of the Imm field of a 3RI instruction is interpreted as a register ID, ...).


    Granted, seemingly even a naive approach is able to get around 20% ILP
    out of "GCC -O3" output for RV64G...

    But, the GCC output doesn't seem to be quite as weak as some people are claiming either.


    ties the code to a specific pipeline structure, and becomes
    effectively moot with OoO CPU designs).

    OoO exists, in a practical sense, to abstract the pipeline out of the compiler; or conversely, to allow multiple implementations to run the
    same compiled code optimally on each implementation.


    Granted, but OoO isn't cheap.


    So, a case could be made that a "general use" ISA be designed without
    the use of explicit bundling. In my case, using the bundle flags also
    requires the code to use an instruction to signal to the CPU what
    configuration of pipeline it expects to run on, with the CPU able to
    fall back to scalar (or superscalar) execution if it does not match.

    Sounds like a bridge too far for your 8-wide GBOoO machine.


    For sake of possible fancier OoO stuff, I upheld a basic requirement for
    the instruction stream:
    The semantics of the instructions as executed in bundled order needs to
    be equivalent to that of the instructions as executed in sequential order.

    In this case, the OoO CPU can entirely ignore the bundle hints, and
    treat "WEXMD" as effectively a NOP.


    This would have broken down for WEX-5W and WEX-6W (where enforcing a parallel==sequential constraint effectively becomes unworkable, and/or
    renders the wider pipeline effectively moot), but these designs are
    likely dead anyways.

    And, with 3-wide, the parallel==sequential order constraint remains in
    effect.


    For the most part, thus far nearly everything has ended up as "Mode
    2", namely:
       3 lanes;
         Lane 1 does everything;
         Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
         Lane 3 only does Basic ALU ops and a few CONV ops and similar.
           Lane 3 originally also did Shift, dropped to reduce cost.
         Mem ops may eat Lane 3, ...

    Try 6-lanes:
       1,2,3 Memory ops + integer ADD and Shifts
       4     FADD   ops + integer ADD and FMisc
       5     FMAC   ops + integer ADD
       6     CMP-BR ops + integer ADD


    As can be noted, my thing is more a "LIW" rather than a "true VLIW".

    So, MEM/BRA/CMP/... all end up in Lane 1.

    Lanes 2/3 effectively ending up used for fold over most of the ALU ops
    turning Lane 1 mostly into a wall of Load and Store instructions.


    Where, say:
       Mode 0 (Default):
         Only scalar code is allowed, CPU may use superscalar (if available).
       Mode 1:
         2 lanes:
           Lane 1 does everything;
           Lane 2 does ALU, Shift, and CONV.
         Mem ops take up both lanes.
           Effectively scalar for Load/Store.
           Later defined that 128-bit MOV.X is allowed in a Mode 1 core.
    Modeless.



    Had defined wider modes, and ones that allow dual-lane IO and FPU
    instructions, but these haven't seen use (too expensive to support in
    hardware).

    Had ended up with the ambiguous "extension" to the Mode 2 rules of
    allowing an FPU instruction to be executed from Lane 2 if there was
    not an FPU instruction in Lane 1, or allowing co-issuing certain FPU
    instructions if they effectively combine into a corresponding SIMD op.

    In my current configurations, there is only a single memory access port.

    This should imply that your 3-wide pipeline is running at 90%-95% memory/cache saturation.


    If you mean that execution is mostly running end-to-end memory
    operations, yeah, this is basically true.


    Comparably, RV code seems to end up running a lot of non-memory ops in
    Lane 1, whereas BJX2 is mostly running lots of memory ops, with Lane 2 handling most of the ALU ops and similar (and Lane 3, occasionally).


    A second memory access port would help with performance, but is
    comparably a rather expensive feature (and doesn't help enough to
    justify its fairly steep cost).

    For lower-end cores, a case could be made for assuming a 1-wide CPU
    with a 2R1W register file, but designing the whole ISA around this
    limitation and not allowing for anything more is limiting (and mildly
    detrimental to performance). If we can assume cores with an FPU, we
    can probably also assume cores with more than two register read ports
    available.

    If you design around the notion of a 3R1W register file, FMAC and INSERT
    fall out of the encoding easily. Done right, one can switch it into a 4R
    or 4W register file for ENTER and EXIT--lessening the overhead of call/ret.


    Possibly.

    It looks like some savings could be possible in terms of prologs and
    epilogs.

    As-is, these are generally like:
    MOV LR, R18
    MOV GBR, R19
    ADD -192, SP
    MOV.X R18, (SP, 176) //save GBR and LR
    MOV.X ... //save registers

    WEXMD 2 //specify that we want 3-wide execution here

    //Reload GBR, *1
    MOV.Q (GBR, 0), R18
    MOV 0, R0 //special reloc here
    MOV.Q (GBR, R0), R18
    MOV R18, GBR

    //Generate Stack Canary, *2
    MOV 0x5149, R18 //magic number (randomly generated)
    VSKG R18, R18 //Magic (combines input with SP and magic numbers)
    MOV.Q R18, (SP, 144)

    ...
    function-specific stuff
    ...

    MOV 0x5149, R18
    MOV.Q (SP, 144), R19
    VSKC R18, R19 //Validate canary
    ...


    *1: This part ties into the ABI, and mostly exists so that each PE image
    can get GBR reloaded back to its own ".data"/".bss" sections (with
    multiple program instances in a single address space). But, does mean
    that pretty much every non-leaf function ends up needing to go through
    this ritual.

    *2: Pretty much any function that has local arrays or similar, serves to protect register save area. If the magic number can't regenerate a
    matching canary at the end of the function, then a fault is generated.

    The cost of some of this starts to add up.


    In isolation, not much, but if all this happens, say, 500 or 1000 times
    or more in a program, this can add up.



    ....

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 21 18:57:27 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
    BGB wrote:

    Compilers are notoriously unable to outguess a good branch predictor.


    Errm, assuming the compiler is capable of things like general-case
    inlining and loop-unrolling.

    I was thinking of simpler things, like shuffling operators between independent (sub)expressions to limit the number of register-register dependencies.

    Like, in-order superscalar isn't going to do crap if nearly every instruction depends on every preceding instruction. Even pipelining
    can't help much with this.

    Pipelining CREATED this (back to back dependencies). No amount of
    pipelining can eradicate RAW data dependencies.

    The compiler can shuffle the instructions into an order to limit the
    number of register dependencies and better fit the pipeline. But, then,
    most of the "hard parts" are already done (so it doesn't take much more
    for the compiler to flag which instructions can run in parallel).

    Compiler scheduling works for exactly 1 pipeline implementation and
    is suboptimal for all others.

    Meanwhile, a naive superscalar may miss cases that could be run in
    parallel, if it is evaluating the rules "coarsely" (say, evaluating what
    is safe or not safe to run things in parallel based on general groupings
    of opcodes rather than the rules of specific opcodes; or, say, false-positive register alias if, say, part of the Imm field of a 3RI instruction is interpreted as a register ID, ...).


    Granted, seemingly even a naive approach is able to get around 20% ILP
    out of "GCC -O3" output for RV64G...

    But, the GCC output doesn't seem to be quite as weak as some people are claiming either.


    ties the code to a specific pipeline structure, and becomes
    effectively moot with OoO CPU designs).

    OoO exists, in a practical sense, to abstract the pipeline out of the
    compiler; or conversely, to allow multiple implementations to run the
    same compiled code optimally on each implementation.


    Granted, but OoO isn't cheap.

    But it does get the job done.

    So, a case could be made that a "general use" ISA be designed without
    the use of explicit bundling. In my case, using the bundle flags also
    requires the code to use an instruction to signal to the CPU what
    configuration of pipeline it expects to run on, with the CPU able to
    fall back to scalar (or superscalar) execution if it does not match.

    Sounds like a bridge too far for your 8-wide GBOoO machine.


    For sake of possible fancier OoO stuff, I upheld a basic requirement for
    the instruction stream:
    The semantics of the instructions as executed in bundled order needs to
    be equivalent to that of the instructions as executed in sequential order.

    In this case, the OoO CPU can entirely ignore the bundle hints, and
    treat "WEXMD" as effectively a NOP.


    This would have broken down for WEX-5W and WEX-6W (where enforcing a parallel==sequential constraint effectively becomes unworkable, and/or renders the wider pipeline effectively moot), but these designs are
    likely dead anyways.

    And, with 3-wide, the parallel==sequential order constraint remains in effect.


    For the most part, thus far nearly everything has ended up as "Mode
    2", namely:
       3 lanes;
         Lane 1 does everything;
         Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
         Lane 3 only does Basic ALU ops and a few CONV ops and similar.
           Lane 3 originally also did Shift, dropped to reduce cost.
         Mem ops may eat Lane 3, ...

    Try 6-lanes:
       1,2,3 Memory ops + integer ADD and Shifts
       4     FADD   ops + integer ADD and FMisc
       5     FMAC   ops + integer ADD
       6     CMP-BR ops + integer ADD


    As can be noted, my thing is more a "LIW" rather than a "true VLIW".

    Mine is neither LIW or VLIW but it definitely is LBIO through GBOoO

    So, MEM/BRA/CMP/... all end up in Lane 1.

    Lanes 2/3 effectively ending up used for fold over most of the ALU ops turning Lane 1 mostly into a wall of Load and Store instructions.


    Where, say:
       Mode 0 (Default):
         Only scalar code is allowed, CPU may use superscalar (if available).
       Mode 1:
         2 lanes:
           Lane 1 does everything;
           Lane 2 does ALU, Shift, and CONV.
         Mem ops take up both lanes.
           Effectively scalar for Load/Store.
           Later defined that 128-bit MOV.X is allowed in a Mode 1 core. >> Modeless.



    Had defined wider modes, and ones that allow dual-lane IO and FPU
    instructions, but these haven't seen use (too expensive to support in
    hardware).

    Had ended up with the ambiguous "extension" to the Mode 2 rules of
    allowing an FPU instruction to be executed from Lane 2 if there was
    not an FPU instruction in Lane 1, or allowing co-issuing certain FPU
    instructions if they effectively combine into a corresponding SIMD op.

    In my current configurations, there is only a single memory access port. >>
    This should imply that your 3-wide pipeline is running at 90%-95%
    memory/cache saturation.


    If you mean that execution is mostly running end-to-end memory
    operations, yeah, this is basically true.


    Comparably, RV code seems to end up running a lot of non-memory ops in
    Lane 1, whereas BJX2 is mostly running lots of memory ops, with Lane 2 handling most of the ALU ops and similar (and Lane 3, occasionally).

    One of the things that I notice with My 66000 is when you get all the constants you ever need at the calculation OpCodes, you end up with
    FEWER instructions that "go random places" such as instructions that
    <well> paste constants together. This leave you with a data dependent
    string of calculations with occasional memory references. That is::
    universal constants gets rid of the easy to pipeline extra instructions
    leaving the meat of the algorithm exposed.


    If you design around the notion of a 3R1W register file, FMAC and INSERT
    fall out of the encoding easily. Done right, one can switch it into a 4R
    or 4W register file for ENTER and EXIT--lessening the overhead of call/ret. >>

    Possibly.

    It looks like some savings could be possible in terms of prologs and epilogs.

    As-is, these are generally like:
    MOV LR, R18
    MOV GBR, R19
    ADD -192, SP
    MOV.X R18, (SP, 176) //save GBR and LR
    MOV.X ... //save registers

    Why not an instruction that saves LR and GBR without wasting instructions
    to place them side by side prior to saving them ??

    WEXMD 2 //specify that we want 3-wide execution here

    //Reload GBR, *1
    MOV.Q (GBR, 0), R18
    MOV 0, R0 //special reloc here
    MOV.Q (GBR, R0), R18
    MOV R18, GBR

    It is gorp like that that lead me to do it in HW with ENTER and EXIT.
    Save registers to the stack, setup FP if desired, allocate stack on SP,
    and decide if EXIT also does RET or just reloads the file. This would
    require 2 free registers if done in pure SW, along with several MOVs...

    //Generate Stack Canary, *2
    MOV 0x5149, R18 //magic number (randomly generated)
    VSKG R18, R18 //Magic (combines input with SP and magic numbers)
    MOV.Q R18, (SP, 144)

    ...
    function-specific stuff
    ...

    MOV 0x5149, R18
    MOV.Q (SP, 144), R19
    VSKC R18, R19 //Validate canary
    ...


    *1: This part ties into the ABI, and mostly exists so that each PE image
    can get GBR reloaded back to its own ".data"/".bss" sections (with

    Universal displacements make GBR unnecessary as a memory reference can
    be accompanied with a 16-bit, 32-bit, or 64-bit displacement. Yes, you
    can read GOT[#i] directly without a pointer to it.

    multiple program instances in a single address space). But, does mean
    that pretty much every non-leaf function ends up needing to go through
    this ritual.

    Universal constant solves the underlying issue.

    *2: Pretty much any function that has local arrays or similar, serves to protect register save area. If the magic number can't regenerate a
    matching canary at the end of the function, then a fault is generated.

    My 66000 can place the callee save registers in a place where user cannot access them with LDs or modify them with STs. So malicious code cannot
    damage the contract between ABI and core.

    The cost of some of this starts to add up.


    In isolation, not much, but if all this happens, say, 500 or 1000 times
    or more in a program, this can add up.

    Was thinking about that last night. H&P "book" statistics say that call/ret represents 2% of instructions executed. But if you add up the prologue and epilogue instructions you find 8% of instructions are related to calling
    and returning--taking the problem from (at 2%) ignorable to (at 8%) a big ticket item demanding something be done.

    8% represents saving/restoring only 3 registers vis stack and associated SP arithmetic. So, it can easily go higher.

    ....
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 21 17:56:21 2024
    From Newsgroup: comp.arch

    On 4/21/2024 1:57 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
    BGB wrote:

    Compilers are notoriously unable to outguess a good branch predictor.


    Errm, assuming the compiler is capable of things like general-case
    inlining and loop-unrolling.

    I was thinking of simpler things, like shuffling operators between
    independent (sub)expressions to limit the number of register-register
    dependencies.

    Like, in-order superscalar isn't going to do crap if nearly every
    instruction depends on every preceding instruction. Even pipelining
    can't help much with this.

    Pipelining CREATED this (back to back dependencies). No amount of
    pipelining can eradicate RAW data dependencies.


    Pretty much, this is the problem.

    But, when one converts from expressions to instructions either via
    directly walking the AST, or by going to RPN and then generating
    instructions from the RPN. Then the generated code has this problem
    pretty bad.

    Seemingly the only real fix is to try to shuffle things around, at the
    3AC or machine-instruction level, or both, to try to reduce the number
    of RAW dependencies.


    Though, this is an areas where "things could have been done better" in
    BGBCC. Though, mostly it would be in the backend.

    Ironically, the approach of first compiling everything into an RPN
    bytecode, then generating 3AC and machine code from the RPN, seems to
    work reasonably OK. Even if the bytecode itself is kinda weird.

    Though, one area that could be improved is the memory overhead of BGBCC,
    where generally BGBCC uses too much RAM to really be viable to have
    TestKern be self-hosting.


    The compiler can shuffle the instructions into an order to limit the
    number of register dependencies and better fit the pipeline. But,
    then, most of the "hard parts" are already done (so it doesn't take
    much more for the compiler to flag which instructions can run in
    parallel).

    Compiler scheduling works for exactly 1 pipeline implementation and
    is suboptimal for all others.


    Possibly true.

    But, can note, even crude shuffling is better than no shuffling this
    case. And, the shuffling needed to make an in-order superscalar not
    perform like crap, also happens to map over well to a LIW (and is the
    main hard part of the problem).


    Meanwhile, a naive superscalar may miss cases that could be run in
    parallel, if it is evaluating the rules "coarsely" (say, evaluating
    what is safe or not safe to run things in parallel based on general
    groupings of opcodes rather than the rules of specific opcodes; or,
    say, false-positive register alias if, say, part of the Imm field of a
    3RI instruction is interpreted as a register ID, ...).


    Granted, seemingly even a naive approach is able to get around 20% ILP
    out of "GCC -O3" output for RV64G...

    But, the GCC output doesn't seem to be quite as weak as some people
    are claiming either.


    ties the code to a specific pipeline structure, and becomes
    effectively moot with OoO CPU designs).

    OoO exists, in a practical sense, to abstract the pipeline out of the
    compiler; or conversely, to allow multiple implementations to run the
    same compiled code optimally on each implementation.


    Granted, but OoO isn't cheap.

    But it does get the job done.


    But... Also makes the CPU too big and expensive to fit into most consumer/hobbyist grade FPGAs.

    They can do in-order designs pretty OK though.


    People were doing some impressive looking things over on the Altera side
    of things, but it is harder to do a direct comparison between Cyclone V
    and Artix / Spartan.


    Some stuff I was skimming though implied that I guess the free version
    of Quartus is more limited vs Vivado, and one effectively needs to pay
    for the commercial version to make full use of the FPGA (whereas Vivado
    allows mostly full use of the FPGA, but not any FPGA's larger than a
    certain cutoff).

    Well, and the non-free version of Vivado costs well more than I could
    justify spending on a hobby project.


    So, a case could be made that a "general use" ISA be designed
    without the use of explicit bundling. In my case, using the bundle
    flags also requires the code to use an instruction to signal to the
    CPU what configuration of pipeline it expects to run on, with the
    CPU able to fall back to scalar (or superscalar) execution if it
    does not match.

    Sounds like a bridge too far for your 8-wide GBOoO machine.


    For sake of possible fancier OoO stuff, I upheld a basic requirement
    for the instruction stream:
    The semantics of the instructions as executed in bundled order needs
    to be equivalent to that of the instructions as executed in sequential
    order.

    In this case, the OoO CPU can entirely ignore the bundle hints, and
    treat "WEXMD" as effectively a NOP.


    This would have broken down for WEX-5W and WEX-6W (where enforcing a
    parallel==sequential constraint effectively becomes unworkable, and/or
    renders the wider pipeline effectively moot), but these designs are
    likely dead anyways.

    And, with 3-wide, the parallel==sequential order constraint remains in
    effect.


    For the most part, thus far nearly everything has ended up as "Mode
    2", namely:
       3 lanes;
         Lane 1 does everything;
         Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
         Lane 3 only does Basic ALU ops and a few CONV ops and similar. >>>>        Lane 3 originally also did Shift, dropped to reduce cost. >>>>      Mem ops may eat Lane 3, ...

    Try 6-lanes:
        1,2,3 Memory ops + integer ADD and Shifts
        4     FADD   ops + integer ADD and FMisc
        5     FMAC   ops + integer ADD
        6     CMP-BR ops + integer ADD


    As can be noted, my thing is more a "LIW" rather than a "true VLIW".

    Mine is neither LIW or VLIW but it definitely is LBIO through GBOoO


    I aimed for Scalar and LIW.

    On the XC7S25 and XC7A35T, can't really do much more than a simple
    scalar core (it is a pain enough even trying to fit an FPU into the thing).

    On the XC7S50 (~ 33k LUT), it is more a challenge of trying to fit both
    a 3-wide core and an FP-SIMD unit (fitting the CPU onto it is a little
    easier if one skips the existence of FP-SIMD, or can accept slower SIMD implemented by pipelining the elements through the FPU).

    I had been looking into a configuration for the XC7S50 which had dropped
    down to a more limited 2-wide configuration (with a 4R2W register file),
    but keeping the SIMD unit intact. Mostly trying to optimizing this case
    for doing lots of SIMD math for NN workloads.

    This is vaguely similar to a past considered "GPU Profile", but
    ultimately ended up implementing the rasterizer module instead (which is cheaper and a little faster at this task than a CPU core would have
    been, albeit less flexible).



    Doing in-order superscalar for BJX2 could be possible, but haven't put
    much effort into this thus far, as the "WEX-3W" profile currently hits
    this nail pretty well.

    Did end up going with superscalar for RISC-V, mostly as no other option.

    It is, however, a fairly narrow window...


    For smaller targets, need to fall back to scalar, and for wider, part of
    the ISA design becomes effectively moot.


    So, MEM/BRA/CMP/... all end up in Lane 1.

    Lanes 2/3 effectively ending up used for fold over most of the ALU ops
    turning Lane 1 mostly into a wall of Load and Store instructions.


    Where, say:
       Mode 0 (Default):
         Only scalar code is allowed, CPU may use superscalar (if
    available).
       Mode 1:
         2 lanes:
           Lane 1 does everything;
           Lane 2 does ALU, Shift, and CONV.
         Mem ops take up both lanes.
           Effectively scalar for Load/Store.
           Later defined that 128-bit MOV.X is allowed in a Mode 1 core. >>> Modeless.



    Had defined wider modes, and ones that allow dual-lane IO and FPU
    instructions, but these haven't seen use (too expensive to support
    in hardware).

    Had ended up with the ambiguous "extension" to the Mode 2 rules of
    allowing an FPU instruction to be executed from Lane 2 if there was
    not an FPU instruction in Lane 1, or allowing co-issuing certain FPU
    instructions if they effectively combine into a corresponding SIMD op.

    In my current configurations, there is only a single memory access
    port.

    This should imply that your 3-wide pipeline is running at 90%-95%
    memory/cache saturation.


    If you mean that execution is mostly running end-to-end memory
    operations, yeah, this is basically true.


    Comparably, RV code seems to end up running a lot of non-memory ops in
    Lane 1, whereas BJX2 is mostly running lots of memory ops, with Lane 2
    handling most of the ALU ops and similar (and Lane 3, occasionally).

    One of the things that I notice with My 66000 is when you get all the constants you ever need at the calculation OpCodes, you end up with
    FEWER instructions that "go random places" such as instructions that
    <well> paste constants together. This leave you with a data dependent
    string of calculations with occasional memory references. That is::
    universal constants gets rid of the easy to pipeline extra instructions leaving the meat of the algorithm exposed.


    Possibly true.

    RISC-V tends to have a lot of extra instructions due to lack of big
    constants and lack of indexed addressing.


    And, BJX2 has a lot of frivolous register-register MOV instructions.

    Also often bulkier prologs/epilogs (despite folding off the register save/restore past a certain number of registers).

    Seemingly, GCC is better at being more effective with fewer registers,
    if compared with BGBCC (which kinda chews through registers).

    Have managed to get to a point of being roughly break-even in terms of
    ".text" size (and a little smaller overall, due to not also having some
    big mess of constants off in ".rodata" or similar).


    Some bulk in my case is due to GBR reloading (needed for the ABI), and
    stack canary checks. Can shave some size of the binaries by disabling
    them, but then code is more vulnerable to potential buffer overflows.

    One can also enable bounds checking, but this has an overhead for both code-size and performance (it is comparably more heavyweight than the stack-canary checks).

    Though, GCC does none of these by default...



    If you design around the notion of a 3R1W register file, FMAC and INSERT >>> fall out of the encoding easily. Done right, one can switch it into a 4R >>> or 4W register file for ENTER and EXIT--lessening the overhead of
    call/ret.


    Possibly.

    It looks like some savings could be possible in terms of prologs and
    epilogs.

    As-is, these are generally like:
       MOV    LR, R18
       MOV    GBR, R19
       ADD    -192, SP
       MOV.X  R18, (SP, 176)  //save GBR and LR
       MOV.X  ...  //save registers

    Why not an instruction that saves LR and GBR without wasting instructions
    to place them side by side prior to saving them ??


    I have an optional MOV.C instruction, but would need to restructure the
    code for generating the prologs to make use of them in this case.

    Say:
    MOV.C GBR, (SP, 184)
    MOV.C LR, (SP, 176)

    Though, MOV.C is considered optional.

    There is a "MOV.C Lite" option, which saves some cost by only allowing
    it for certain CR's (mostly LR and GBR), which also sort of overlaps
    with (and is needed) by RISC-V mode, because these registers are in GPR
    land for RV.

    But, in any case, current compiler output shuffles them to R18 and R19
    before saving them.


       WEXMD  2  //specify that we want 3-wide execution here

       //Reload GBR, *1
       MOV.Q  (GBR, 0), R18
       MOV    0, R0  //special reloc here
       MOV.Q  (GBR, R0), R18
       MOV    R18, GBR


    Correction:
    MOV.Q (R18, R0), R18


    It is gorp like that that lead me to do it in HW with ENTER and EXIT.
    Save registers to the stack, setup FP if desired, allocate stack on SP,
    and decide if EXIT also does RET or just reloads the file. This would require 2 free registers if done in pure SW, along with several MOVs...


    Possibly.
    The partial reason it loads into R0 and uses R0 as an index, was that I defined this mechanism before jumbo prefixes existed, and hadn't updated
    it to allow for jumbo prefixes.


    Well, and if I used a direct displacement for GBR (which, along with PC,
    is always BYTE Scale), this would have created a hard limit of 64 DLL's
    per process-space (I defined it as Disp24, which allows a more
    reasonable hard upper limit of 2M DLLs per process-space).

    Granted, nowhere near even the limit of 64 as of yet. But, I had noted
    that Windows programs would often easily exceed this limit, with even a
    fairly simple program pulling in a fairly large number of random DLLs,
    so in any case, a larger limit was needed.


    One potential optimization here is that the main EXE will always be 0 in
    the process, so this sequence could be reduced to, potentially:
    MOV.Q (GBR, 0), R18
    MOV.C (R18, 0), GBR

    Early on, I did not have the constraint that main EXE was always 0, and
    had initially assumed it would be treated equivalently to a DLL.


       //Generate Stack Canary, *2
       MOV    0x5149, R18  //magic number (randomly generated)
       VSKG   R18, R18  //Magic (combines input with SP and magic numbers) >>    MOV.Q  R18, (SP, 144)

       ...
       function-specific stuff
       ...

       MOV    0x5149, R18
       MOV.Q  (SP, 144), R19
       VSKC   R18, R19  //Validate canary
       ...


    *1: This part ties into the ABI, and mostly exists so that each PE
    image can get GBR reloaded back to its own ".data"/".bss" sections (with

    Universal displacements make GBR unnecessary as a memory reference can
    be accompanied with a 16-bit, 32-bit, or 64-bit displacement. Yes, you
    can read GOT[#i] directly without a pointer to it.


    If I were doing a more conventional ABI, I would likely use (PC,
    Disp33s) for accessing global variables.

    Problem is:
    What if one wants multiple logical instances of a given PE image in a
    single address space?

    PC REL breaks in this case, unless you load N copies of each PE image,
    which is a waste of memory (well, or use COW mappings, mandating the use
    of an MMU).


    ELF FDPIC had used a different strategy, but then effectively turned
    each function call into something like (in SH):
    MOV R14, R2 //R14=GOT
    MOV disp, R0 //offset into GOT
    ADD R0, R2 //adjust by offset
    //R2=function pointer
    MOV.L (R2, 0), R1 //function address
    MOV.L (R2, 4), R3 //GOT
    JSR R1

    In the callee:
    ... save registers ...
    MOV R3, R14 //put GOT into a callee-save register
    ...

    In the BJX2 ABI, had rolled this part into the callee, reasoning that
    handling it in the callee (per-function) was less overhead than handling
    it in the caller (per function call).


    Though, on the RISC-V side, it has the relative advantage of compiling
    for absolute addressing, albeit still loses in terms of performance.

    I don't imagine an FDPIC version of RISC-V would win here, but this is
    only assuming there exists some way to get GCC to output FDPIC binaries
    (most I could find, was people debating whether to add FDPIC support for RISC-V).

    PIC or PIE would also sort of work, but these still don't really allow
    for multiple program instances in a single address space.


    multiple program instances in a single address space). But, does mean
    that pretty much every non-leaf function ends up needing to go through
    this ritual.

    Universal constant solves the underlying issue.


    I am not so sure that they could solve the "map multiple instances of
    the same binary into a single address space" issue, which is sort of the
    whole thing for why GBR is being used.

    Otherwise, I would have been using PC-REL...




    *2: Pretty much any function that has local arrays or similar, serves
    to protect register save area. If the magic number can't regenerate a
    matching canary at the end of the function, then a fault is generated.

    My 66000 can place the callee save registers in a place where user cannot access them with LDs or modify them with STs. So malicious code cannot
    damage the contract between ABI and core.


    Possibly. I am using a conventional linear stack.

    Downside: There is a need either for bounds checking or canaries.
    Canaries are the cheaper option in this case.


    The cost of some of this starts to add up.


    In isolation, not much, but if all this happens, say, 500 or 1000
    times or more in a program, this can add up.

    Was thinking about that last night. H&P "book" statistics say that call/ret represents 2% of instructions executed. But if you add up the prologue and epilogue instructions you find 8% of instructions are related to calling
    and returning--taking the problem from (at 2%) ignorable to (at 8%) a big ticket item demanding something be done.

    8% represents saving/restoring only 3 registers vis stack and associated SP arithmetic. So, it can easily go higher.


    I guess it could make sense to add a compiler stat for this...

    The save/restore can get folded off, but generally only done for
    functions with a larger number of registers being saved/restored (and
    does not cover secondary things like GBR reload or stack canary stuff,
    which appears to possibly be a significant chunk of space).


    Goes and adds a stat for averages:
    Prolog: 8% (avg= 24 bytes)
    Epilog: 4% (avg= 12 bytes)
    Body : 88% (avg=260 bytes)

    With 959 functions counted (excluding empty functions/prototypes).


    ....

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 21 23:31:55 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/21/2024 1:57 PM, MitchAlsup1 wrote:
    BGB wrote:

    One of the things that I notice with My 66000 is when you get all the
    constants you ever need at the calculation OpCodes, you end up with
    FEWER instructions that "go random places" such as instructions that
    <well> paste constants together. This leave you with a data dependent
    string of calculations with occasional memory references. That is::
    universal constants gets rid of the easy to pipeline extra instructions
    leaving the meat of the algorithm exposed.


    Possibly true.

    RISC-V tends to have a lot of extra instructions due to lack of big constants and lack of indexed addressing.

    You forgot the "every one an his brother" design of the ISA>

    And, BJX2 has a lot of frivolous register-register MOV instructions.

    I empower you to get rid of them....
    <snip>
    If you design around the notion of a 3R1W register file, FMAC and INSERT >>>> fall out of the encoding easily. Done right, one can switch it into a 4R >>>> or 4W register file for ENTER and EXIT--lessening the overhead of
    call/ret.


    Possibly.

    It looks like some savings could be possible in terms of prologs and
    epilogs.

    As-is, these are generally like:
       MOV    LR, R18
       MOV    GBR, R19
       ADD    -192, SP
       MOV.X  R18, (SP, 176)  //save GBR and LR
       MOV.X  ...  //save registers

    Why not an instruction that saves LR and GBR without wasting instructions
    to place them side by side prior to saving them ??


    I have an optional MOV.C instruction, but would need to restructure the
    code for generating the prologs to make use of them in this case.

    Say:
    MOV.C GBR, (SP, 184)
    MOV.C LR, (SP, 176)

    Though, MOV.C is considered optional.

    There is a "MOV.C Lite" option, which saves some cost by only allowing
    it for certain CR's (mostly LR and GBR), which also sort of overlaps
    with (and is needed) by RISC-V mode, because these registers are in GPR
    land for RV.

    But, in any case, current compiler output shuffles them to R18 and R19 before saving them.


       WEXMD  2  //specify that we want 3-wide execution here

       //Reload GBR, *1
       MOV.Q  (GBR, 0), R18
       MOV    0, R0  //special reloc here
       MOV.Q  (GBR, R0), R18
       MOV    R18, GBR


    Correction:
    MOV.Q (R18, R0), R18


    It is gorp like that that lead me to do it in HW with ENTER and EXIT.
    Save registers to the stack, setup FP if desired, allocate stack on SP,
    and decide if EXIT also does RET or just reloads the file. This would
    require 2 free registers if done in pure SW, along with several MOVs...


    Possibly.
    The partial reason it loads into R0 and uses R0 as an index, was that I defined this mechanism before jumbo prefixes existed, and hadn't updated
    it to allow for jumbo prefixes.

    No time like the present...

    Well, and if I used a direct displacement for GBR (which, along with PC,
    is always BYTE Scale), this would have created a hard limit of 64 DLL's
    per process-space (I defined it as Disp24, which allows a more
    reasonable hard upper limit of 2M DLLs per process-space).

    In my case, restricting myself to 32-bit IP relative addressing, GOT can
    be anywhere within ±2GB of the accessing instruction and can be as big as
    one desires.

    Granted, nowhere near even the limit of 64 as of yet. But, I had noted
    that Windows programs would often easily exceed this limit, with even a fairly simple program pulling in a fairly large number of random DLLs,
    so in any case, a larger limit was needed.

    Due to the way linkages work in My 66000, each DLL gets its own GOT.
    So there is essentially no bounds on how many can be present/in-use.
    A LD of a GOT[entry] gets a pointer to the external variable.
    A CALX of GOT[entry] is a call through the GOT table using std ABI.
    {{There is no PLT}}

    One potential optimization here is that the main EXE will always be 0 in
    the process, so this sequence could be reduced to, potentially:
    MOV.Q (GBR, 0), R18
    MOV.C (R18, 0), GBR

    Early on, I did not have the constraint that main EXE was always 0, and
    had initially assumed it would be treated equivalently to a DLL.


       //Generate Stack Canary, *2
       MOV    0x5149, R18  //magic number (randomly generated)
       VSKG   R18, R18  //Magic (combines input with SP and magic numbers) >>>    MOV.Q  R18, (SP, 144)

       ...
       function-specific stuff
       ...

       MOV    0x5149, R18
       MOV.Q  (SP, 144), R19
       VSKC   R18, R19  //Validate canary
       ...


    *1: This part ties into the ABI, and mostly exists so that each PE
    image can get GBR reloaded back to its own ".data"/".bss" sections (with >>
    Universal displacements make GBR unnecessary as a memory reference can
    be accompanied with a 16-bit, 32-bit, or 64-bit displacement. Yes, you
    can read GOT[#i] directly without a pointer to it.


    If I were doing a more conventional ABI, I would likely use (PC,
    Disp33s) for accessing global variables.

    Even those 128GB away ??

    Problem is:
    What if one wants multiple logical instances of a given PE image in a
    single address space?

    Not a problem when each PE has a different set of mapping tables (at least
    the entries pointing at GOTs[*].

    PC REL breaks in this case, unless you load N copies of each PE image,
    which is a waste of memory (well, or use COW mappings, mandating the use
    of an MMU).


    ELF FDPIC had used a different strategy, but then effectively turned
    each function call into something like (in SH):
    MOV R14, R2 //R14=GOT
    MOV disp, R0 //offset into GOT
    ADD R0, R2 //adjust by offset
    //R2=function pointer
    MOV.L (R2, 0), R1 //function address
    MOV.L (R2, 4), R3 //GOT
    JSR R1

    Which I do with::

    CALX [IP,R0,#GOT+index<<3-.]

    In the callee:
    ... save registers ...
    MOV R3, R14 //put GOT into a callee-save register
    ...

    In the BJX2 ABI, had rolled this part into the callee, reasoning that handling it in the callee (per-function) was less overhead than handling
    it in the caller (per function call).


    Though, on the RISC-V side, it has the relative advantage of compiling
    for absolute addressing, albeit still loses in terms of performance.

    Compiling and linking to absolute addresses works "really well" when one
    needs to place different sections in different memory every time the application/kernel runs due to malicious codes trying to steal everything. ASLR.....

    I don't imagine an FDPIC version of RISC-V would win here, but this is
    only assuming there exists some way to get GCC to output FDPIC binaries (most I could find, was people debating whether to add FDPIC support for RISC-V).

    PIC or PIE would also sort of work, but these still don't really allow
    for multiple program instances in a single address space.

    Once you share the code and some of the data, the overhead of using different mappings for special stuff {GOT, local thread data,...} is

    multiple program instances in a single address space). But, does mean
    that pretty much every non-leaf function ends up needing to go through
    this ritual.

    Universal constant solves the underlying issue.


    I am not so sure that they could solve the "map multiple instances of
    the same binary into a single address space" issue, which is sort of the whole thing for why GBR is being used.

    Otherwise, I would have been using PC-REL...




    *2: Pretty much any function that has local arrays or similar, serves
    to protect register save area. If the magic number can't regenerate a
    matching canary at the end of the function, then a fault is generated.

    My 66000 can place the callee save registers in a place where user cannot
    access them with LDs or modify them with STs. So malicious code cannot
    damage the contract between ABI and core.


    Possibly. I am using a conventional linear stack.

    Downside: There is a need either for bounds checking or canaries.
    Canaries are the cheaper option in this case.


    The cost of some of this starts to add up.


    In isolation, not much, but if all this happens, say, 500 or 1000
    times or more in a program, this can add up.

    Was thinking about that last night. H&P "book" statistics say that call/ret >> represents 2% of instructions executed. But if you add up the prologue and >> epilogue instructions you find 8% of instructions are related to calling
    and returning--taking the problem from (at 2%) ignorable to (at 8%) a big
    ticket item demanding something be done.

    8% represents saving/restoring only 3 registers vis stack and associated SP >> arithmetic. So, it can easily go higher.


    I guess it could make sense to add a compiler stat for this...

    The save/restore can get folded off, but generally only done for
    functions with a larger number of registers being saved/restored (and
    does not cover secondary things like GBR reload or stack canary stuff,
    which appears to possibly be a significant chunk of space).


    Goes and adds a stat for averages:
    Prolog: 8% (avg= 24 bytes)
    Epilog: 4% (avg= 12 bytes)
    Body : 88% (avg=260 bytes)

    With 959 functions counted (excluding empty functions/prototypes).


    ....
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Sun Apr 21 19:16:04 2024
    From Newsgroup: comp.arch

    On Sun, 21 Apr 2024 18:57:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    BGB wrote:

    Like, in-order superscalar isn't going to do crap if nearly every
    instruction depends on every preceding instruction. Even pipelining
    can't help much with this.

    Pipelining CREATED this (back to back dependencies). No amount of
    pipelining can eradicate RAW data dependencies.

    This is quite true. However, in case an unsophisticated individual
    might read this thread, I think that I shall clarify.

    Without pipelining, it is not a problem if each instruction depends on
    the one immediately previous, and so people got used to writing
    programs that way, as it was simple to write the code to do one thing
    before starting to write the code to begin doing another thing.

    This remained true when the simplest original form of pipelining was
    brought in - where fetching one instruction from memory was overlapped
    with decoding the previous instruction, and executing the instruction
    before that.

    It's only when what was originally called "superpipelining" came
    along, where the execute stages of multiple successive instructions
    could be overlapped, that it was necessary to do something about
    dependencies in order to take advantage of the speedup that could
    provide.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 21 22:59:12 2024
    From Newsgroup: comp.arch

    On 4/21/2024 8:16 PM, John Savard wrote:
    On Sun, 21 Apr 2024 18:57:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    BGB wrote:

    Like, in-order superscalar isn't going to do crap if nearly every
    instruction depends on every preceding instruction. Even pipelining
    can't help much with this.

    Pipelining CREATED this (back to back dependencies). No amount of
    pipelining can eradicate RAW data dependencies.

    This is quite true. However, in case an unsophisticated individual
    might read this thread, I think that I shall clarify.

    Without pipelining, it is not a problem if each instruction depends on
    the one immediately previous, and so people got used to writing
    programs that way, as it was simple to write the code to do one thing
    before starting to write the code to begin doing another thing.


    Yeah.

    This is also typical of naive compiler output, say:
    y=m*x+b;
    Turns into RPN as, say:
    LD(m) LD(x) MUL LD(b) ADD ST(C)
    Which, in a naive compiler (though, one with register allocation) may
    become, say:
    MULS R8, R9, R12
    ADD R12, R10, R13
    MOV R13, R11 //MUL output first goes into temporary

    But, if MUL is 3c and ADD is 2c, this ends up needing 6 cycles.

    The situation would be significantly worse in a compiler lacking
    register allocation (would add 8 memory operations to this; similar to
    what one gets with "gcc -O0").

    For the most part, as can be noted, I was comparing against "gcc -O3" on
    the RV64 size, with "-ffuction-sections" and "-Wl,-gc-sections" and
    similar, as otherwise GCC's output is significantly larger. Though,
    nevermind the seemingly fairly bulky ELF metadata (PE/COFF is seemingly
    a bit more compact here). Can note that "-O3" vs "-Os" also doesn't seem
    to make that big of a difference for RV64.



    If one has another expression, one can shuffle the operations between
    the expressions together, and the latency is lower than had no shuffling occurred; and if one can reduce dependencies enough, operations can be
    run in parallel for further gain. But, all this depends on first being
    able to shuffle things to break up the register-register dependencies
    between instructions.


    In BGBCC, this part was done via the WEXifier, which imposes a lot of
    annoying restrictions (partly because it starts working after code
    generation has already taken place).

    In size-optimized code, this doesn't happen, which results in a
    performance hit. This is partly since the WEXifier can only work with
    32-bit instructions, can't cross labels or relocs, and requires the
    register allocator to essentially round-robin the registers to minimize dependencies, ...

    But, preferentially always allocating a new register and avoiding
    reusing registers within a basic block, while it reduces dependencies,
    also eats a lot more registers (with the indirect cost of increasing the number that need to be saved/restored, though the size-impact of this is reduced somewhat via prolog/epilog compression).


    Though, one can shuffle stuff at the 3AC level (which exists in my case between the RPN and final code generation), but this is more hit-or-miss.

    Better would have been to go from 3AC to a "virtual assembler", which
    could then allow reordering before emitting the actual machine code (and
    thus wouldn't be as restricted). This was originally considered, but
    ended up not going this way as it seemed like more work (in terms of
    internal restructuring) than to shove the logic in after the
    machine-code was generated.

    But, the current compiler architecture was the result of always doing
    the most quick/dirty option at the time, which doesn't necessarily
    result in an optimal design.

    Granted, OTOH, "waterfall method" doesn't really have the best
    track-record either (vs the "hack something together, hack on it some
    more, ..." method).


    This remained true when the simplest original form of pipelining was
    brought in - where fetching one instruction from memory was overlapped
    with decoding the previous instruction, and executing the instruction
    before that.

    It's only when what was originally called "superpipelining" came
    along, where the execute stages of multiple successive instructions
    could be overlapped, that it was necessary to do something about
    dependencies in order to take advantage of the speedup that could
    provide.


    Yeah.

    Pipeline:
    PF:
    PC arrives at I$
    Selected from:
    If Branch: Branch-PC
    Else, if Branch-Predicted, Branch Pred Result
    Else, LastPC+PCStep
    IF:
    Fetches 96 bits at PC
    Figures how much to advance PC;
    Figures out if we can do superscalar here.
    Check for register clashes;
    Check for valid prefix and suffix;
    If both checks pass, go for it.
    ID:
    Unpack instruction words;
    Pipeline now splits into 3 lanes;
    Branch predictor does its thing.
    ID2/RF:
    Results come in from the registers;
    Figure out if current bundle can enter EX stages;
    Figure out if each predicated instruction should execute.
    EX1(EX1C|EX1B|EX1A):
    Do stuff: ALU, Initiate memory access, ...
    EX2(EX2C|EX2B|EX2A):
    Do stuff | results arrive.
    EX3(EX3C|EX3B|EX3A):
    Results arrive;
    Produce any final results.
    WB:
    Results are written into register file.

    By EX1, it is known whether or not the branch will actually be taken, so
    (if needed) it may override the former guess of the branch-predictor. By
    EX2, the branch-initiation takes effect, and by EX3, the new PC reaches
    the I$ (overriding whether else would have normally arrived).


    In a few cases (such as a jump between ISA modes), extra cycles may be
    needed to make sure everything is caught up (so, the same PC address is
    held on the I$ input for around 3 cycles in this case).

    This may happen if, say:
    Jumping between Baseline, XG2, or RISC-V;
    WEXMD changing whether WEX decoding is enabled or disabled;
    If disabled, it behaves as if the WEX'ed instructions were scalar;
    Jumbo prefixes ignore this (always behaving as-if it were enabled);
    ...
    Mostly to make sure that IF and ID can decode the instructions as is
    correct for the mode in question.


    John Savard

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Apr 22 07:49:30 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    BGB wrote:

    On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
    Like, in-order superscalar isn't going to do crap if nearly every
    instruction depends on every preceding instruction. Even pipelining
    can't help much with this.

    Pipelining CREATED this (back to back dependencies). No amount of
    pipelining can eradicate RAW data dependencies.

    The compiler can shuffle the instructions into an order to limit the
    number of register dependencies and better fit the pipeline. But,
    then, most of the "hard parts" are already done (so it doesn't take
    much more for the compiler to flag which instructions can run in
    parallel).

    Compiler scheduling works for exactly 1 pipeline implementation and
    is suboptimal for all others.

    Well, yeah.

    OTOH, if your (definitely not my!) compiler can schedule a 4-wide static ordering of operations, then it will be very nearly optimal on 2-wide
    and 3-wide as well. (The difference is typically in a bit more loop
    setup and cleanup code than needed.)

    Hand-optimizing Pentium asm code did teach me to "think like a cpu",
    which is probably the only part of the experience which is still kind of relevant. :-)

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Mon Apr 22 02:44:09 2024
    From Newsgroup: comp.arch

    On 4/22/2024 12:49 AM, Terje Mathisen wrote:
    MitchAlsup1 wrote:
    BGB wrote:

    On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
    Like, in-order superscalar isn't going to do crap if nearly every
    instruction depends on every preceding instruction. Even pipelining
    can't help much with this.

    Pipelining CREATED this (back to back dependencies). No amount of
    pipelining can eradicate RAW data dependencies.

    The compiler can shuffle the instructions into an order to limit the
    number of register dependencies and better fit the pipeline. But,
    then, most of the "hard parts" are already done (so it doesn't take
    much more for the compiler to flag which instructions can run in
    parallel).

    Compiler scheduling works for exactly 1 pipeline implementation and
    is suboptimal for all others.

    Well, yeah.

    OTOH, if your (definitely not my!) compiler can schedule a 4-wide static ordering of operations, then it will be very nearly optimal on 2-wide
    and 3-wide as well. (The difference is typically in a bit more loop
    setup and cleanup code than needed.)

    Hand-optimizing Pentium asm code did teach me to "think like a cpu",
    which is probably the only part of the experience which is still kind of relevant. :-)


    Mine is hard-pressed to even make effective use of the current pipeline,
    so going wider does not make sense at present.


    As I had noted before, the main merit of 3 wide in my case is that it
    makes it easier to justify a 6R register file, which, unlike the 4R
    register file, doesn't choke up with trying to run other instructions in parallel with memory store and similar (which is actually a fairly
    serious restriction given how much memory operations tend to clog up
    Lane 1; opportunities for "ALU|ST" being more common than "ALU|ALU").


    Granted, one could argue that (Reg, Disp) memory addressing could be
    supported entirely within a 2R1W pattern, which while true in premise,
    does not match my implementation (which always uses indexed addressing internally, treating the Disp as a virtual register; thus eating 3
    register ports).

    Well, and for the 4R2W configuration, the main priority is minimizing
    LUT cost (which favors leaving it as-is, with the current restrictions).


    Granted, some similar issues apply to 128-bit MOV.X and SIMD ops, which
    as-is can only exist as scalar ops. These could potentially also be
    hacked around (say, to allow ALU|SIMD or ALU|MOV.X, but the "fix" would
    cost a lot of LUTs). Mostly in that variability in terms of input
    routing does not come cheap.


    Though, that said, the 3rd lane still gets used for a share of basic ALU instructions, so isn't entirely going to waste either.


    Terje


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Mon Apr 22 04:42:07 2024
    From Newsgroup: comp.arch

    On 4/21/2024 6:31 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/21/2024 1:57 PM, MitchAlsup1 wrote:
    BGB wrote:

    One of the things that I notice with My 66000 is when you get all the
    constants you ever need at the calculation OpCodes, you end up with
    FEWER instructions that "go random places" such as instructions that
    <well> paste constants together. This leave you with a data dependent
    string of calculations with occasional memory references. That is::
    universal constants gets rid of the easy to pipeline extra instructions
    leaving the meat of the algorithm exposed.


    Possibly true.

    RISC-V tends to have a lot of extra instructions due to lack of big
    constants and lack of indexed addressing.

    You forgot the "every one an his brother" design of the ISA>

    And, BJX2 has a lot of frivolous register-register MOV instructions.

    I empower you to get rid of them....


    OK, I more meant that the compiler is prone to emit lots of:
    MOV Reg, Reg

    Rather than, say, the ISA listing being full of redundant "MOV Reg, Reg" encodings...

    But, getting rid of them has been an ongoing battle of compiler fiddling.

    Many were popping up from odd corners, say:
    Register allocation issues (mostly involving function call/return);
    Type casts and promotions between equivalent representations (*1);
    ...

    *1: Had eliminated some of these, by allowing temporaries to be coerced directly into different types in some cases. But, "for reasons" doesn't
    really work with some other types of variables. But, say, int->long, or casting between pointer types, etc, can be done without needing to do
    anything to the value in the register.

    But, yes, performance and code density would be better with fewer
    frivolous register MOVs.



    <snip>
    If you design around the notion of a 3R1W register file, FMAC and
    INSERT
    fall out of the encoding easily. Done right, one can switch it into >>>>> a 4R
    or 4W register file for ENTER and EXIT--lessening the overhead of
    call/ret.


    Possibly.

    It looks like some savings could be possible in terms of prologs and
    epilogs.

    As-is, these are generally like:
       MOV    LR, R18
       MOV    GBR, R19
       ADD    -192, SP
       MOV.X  R18, (SP, 176)  //save GBR and LR
       MOV.X  ...  //save registers

    Why not an instruction that saves LR and GBR without wasting
    instructions
    to place them side by side prior to saving them ??


    I have an optional MOV.C instruction, but would need to restructure
    the code for generating the prologs to make use of them in this case.

    Say:
       MOV.C  GBR, (SP, 184)
       MOV.C  LR, (SP, 176)

    Though, MOV.C is considered optional.

    There is a "MOV.C Lite" option, which saves some cost by only allowing
    it for certain CR's (mostly LR and GBR), which also sort of overlaps
    with (and is needed) by RISC-V mode, because these registers are in
    GPR land for RV.

    But, in any case, current compiler output shuffles them to R18 and R19
    before saving them.


       WEXMD  2  //specify that we want 3-wide execution here

       //Reload GBR, *1
       MOV.Q  (GBR, 0), R18
       MOV    0, R0  //special reloc here
       MOV.Q  (GBR, R0), R18
       MOV    R18, GBR


    Correction:
    ;    MOV.Q  (R18, R0), R18


    It is gorp like that that lead me to do it in HW with ENTER and EXIT.
    Save registers to the stack, setup FP if desired, allocate stack on
    SP, and decide if EXIT also does RET or just reloads the file. This
    would require 2 free registers if done in pure SW, along with several
    MOVs...


    Possibly.
    The partial reason it loads into R0 and uses R0 as an index, was that
    I defined this mechanism before jumbo prefixes existed, and hadn't
    updated it to allow for jumbo prefixes.

    No time like the present...


    OK. Made this change.

    Only a minor change to my compiler and PE loader.


    Well, and if I used a direct displacement for GBR (which, along with
    PC, is always BYTE Scale), this would have created a hard limit of 64
    DLL's per process-space (I defined it as Disp24, which allows a more
    reasonable hard upper limit of 2M DLLs per process-space).

    In my case, restricting myself to 32-bit IP relative addressing, GOT can
    be anywhere within ±2GB of the accessing instruction and can be as big
    as one desires.


    In this case:
    GBR points to the start of ".data" for a given PE image;
    This starts with a pointer to a table of GBR pointers for every DLL in
    the process;
    Each DLL is assigned an index into this table, fixed up at load time;
    The magic ritual, when perfored, will get GBR pointing at the
    ".data"/".bss" sections for that particular DLL.


    But, say, one loads the EXE and DLLs.


    One creates a program instance by allocating memory for each of the
    data/bss sections, copying the data section from the base image, and
    putting it in the table. Then jumping to the entry point with the EXE's section in GBR.

    One can fire up a new instance by allocating a new set of data areas,
    jumping to the entry point as before. This instance does not need to
    know or care that the prior instance exists, even if both exist in the
    same address space, and have all their code at the same addresses (since
    the ".text" sections are shared between all instances).


    Normal PC-relative GOT's can't do this. You would either need multiple
    address spaces, or multiple loaded copies of each image.


    Granted, nowhere near even the limit of 64 as of yet. But, I had noted
    that Windows programs would often easily exceed this limit, with even
    a fairly simple program pulling in a fairly large number of random
    DLLs, so in any case, a larger limit was needed.

    Due to the way linkages work in My 66000, each DLL gets its own GOT.
    So there is essentially no bounds on how many can be present/in-use.
    A LD of a GOT[entry] gets a pointer to the external variable.
    A CALX of GOT[entry] is a call through the GOT table using std ABI.
    {{There is no PLT}}


    OK.


    Had done it a little different:
    Imported function gets a stub, generally like:
    Foo:
    MOV.Q (PC, 4), R1
    JMP R1
    _imp_Foo: .QWORD 0

    Import table (or IAT / Import Address Table) points at _imp_Foo and
    fixes it up to point at the imported function.

    Had defined an alternate version:
    Foo:
    _imp_Foo:
    BRA Abs48

    The loader would see and special-case the BRA Abs48 instruction.


    But, this latter form ran into a problem:
    Things will violently explode if the EXE and DLL (or one DLL and
    another) are not in the same ISA mode (say, Baseline vs XG2).

    Which means, I am back to the less efficient option of needing to load
    then branch.

    Granted:
    MOV Imm64, R1
    JMP R1
    Could also work, and saves a few clock cycles.

    Doesn't currently extend to global variables, but I don't really feel
    this is a huge loss. Might fix eventually.

    Generally, loader is hard-coded to assume import by name, as I didn't
    feel it worth the bother to try to deal with importing by 16-bit ordinal number.



    One potential optimization here is that the main EXE will always be 0
    in the process, so this sequence could be reduced to, potentially:
       MOV.Q (GBR, 0), R18
       MOV.C (R18, 0), GBR

    Early on, I did not have the constraint that main EXE was always 0,
    and had initially assumed it would be treated equivalently to a DLL.


       //Generate Stack Canary, *2
       MOV    0x5149, R18  //magic number (randomly generated)
       VSKG   R18, R18  //Magic (combines input with SP and magic numbers)
       MOV.Q  R18, (SP, 144)

       ...
       function-specific stuff
       ...

       MOV    0x5149, R18
       MOV.Q  (SP, 144), R19
       VSKC   R18, R19  //Validate canary
       ...


    *1: This part ties into the ABI, and mostly exists so that each PE
    image can get GBR reloaded back to its own ".data"/".bss" sections
    (with

    Universal displacements make GBR unnecessary as a memory reference can
    be accompanied with a 16-bit, 32-bit, or 64-bit displacement. Yes,
    you can read GOT[#i] directly without a pointer to it.


    If I were doing a more conventional ABI, I would likely use (PC,
    Disp33s) for accessing global variables.

    Even those 128GB away ??


    Nothing is going to have a ".data" or ".bss" section this big...

    Though, realistically, the PE/COFF format has a hard-limit of around 4GB
    due to 32-bit RVAs.


    Note that the section headers and data-directories are still the same
    basic layout as in PE32+.

    Well, nevermind the LZ4 compression and removal of MZ-EXE stub (I had no
    real need for an MZ EXE stub). Though, once the LZ4 decompression is
    done, the format is basically just PE32+ with the MZ stub removed (and a different format for the contents of the ".rsrc" section, *).

    *: The original ".rsrc" section was absurd, so I essentially replaced it
    with a modified version of the WAD2 format operating in RVA space.


    Problem is:
    What if one wants multiple logical instances of a given PE image in a
    single address space?

    Not a problem when each PE has a different set of mapping tables (at least the entries pointing at GOTs[*].


    Could be.

    Wasn't using GOTs in this case.


    Had instead made an unorthodox reinterpretation of the meaning of the
    "Global Pointer" entry in the Data Directory.

    In the official version of PE/COFF, it is unused and must-be-zero IIRC.
    In my version, it mostly spans the start of ".data" to the end of
    ".bss", and gives the region of memory that GBR is intended to point at.


    PC REL breaks in this case, unless you load N copies of each PE image,
    which is a waste of memory (well, or use COW mappings, mandating the
    use of an MMU).


    ELF FDPIC had used a different strategy, but then effectively turned
    each function call into something like (in SH):
       MOV R14, R2   //R14=GOT
       MOV disp, R0  //offset into GOT
       ADD R0, R2    //adjust by offset
       //R2=function pointer
       MOV.L  (R2, 0), R1  //function address
       MOV.L  (R2, 4), R3  //GOT
       JSR    R1

    Which I do with::

        CALX   [IP,R0,#GOT+index<<3-.]


    OK.


    In the callee:
       ... save registers ...
       MOV R3, R14  //put GOT into a callee-save register
       ...

    In the BJX2 ABI, had rolled this part into the callee, reasoning that
    handling it in the callee (per-function) was less overhead than
    handling it in the caller (per function call).


    Though, on the RISC-V side, it has the relative advantage of compiling
    for absolute addressing, albeit still loses in terms of performance.

    Compiling and linking to absolute addresses works "really well" when one needs to place different sections in different memory every time the application/kernel runs due to malicious codes trying to steal everything. ASLR.....


    Errm.

    The RV64G compiler output tends to be fixed address by default (at least
    with "riscv64-unknown-elf-gcc"). Can't be easily relocated.

    Though, this is theoretically the least overhead scenario for GCC.



    I don't imagine an FDPIC version of RISC-V would win here, but this is
    only assuming there exists some way to get GCC to output FDPIC
    binaries (most I could find, was people debating whether to add FDPIC
    support for RISC-V).

    PIC or PIE would also sort of work, but these still don't really allow
    for multiple program instances in a single address space.

    Once you share the code and some of the data, the overhead of using different
    mappings for special stuff {GOT, local thread data,...} is
    multiple program instances in a single address space). But, does
    mean that pretty much every non-leaf function ends up needing to go
    through this ritual.

    Universal constant solves the underlying issue.


    I am not so sure that they could solve the "map multiple instances of
    the same binary into a single address space" issue, which is sort of
    the whole thing for why GBR is being used.

    Otherwise, I would have been using PC-REL...




    *2: Pretty much any function that has local arrays or similar,
    serves to protect register save area. If the magic number can't
    regenerate a matching canary at the end of the function, then a
    fault is generated.

    My 66000 can place the callee save registers in a place where user
    cannot
    access them with LDs or modify them with STs. So malicious code cannot
    damage the contract between ABI and core.


    Possibly. I am using a conventional linear stack.

    Downside: There is a need either for bounds checking or canaries.
    Canaries are the cheaper option in this case.


    The cost of some of this starts to add up.


    In isolation, not much, but if all this happens, say, 500 or 1000
    times or more in a program, this can add up.

    Was thinking about that last night. H&P "book" statistics say that
    call/ret
    represents 2% of instructions executed. But if you add up the
    prologue and
    epilogue instructions you find 8% of instructions are related to
    calling and returning--taking the problem from (at 2%) ignorable to
    (at 8%) a big
    ticket item demanding something be done.

    8% represents saving/restoring only 3 registers vis stack and
    associated SP
    arithmetic. So, it can easily go higher.


    I guess it could make sense to add a compiler stat for this...

    The save/restore can get folded off, but generally only done for
    functions with a larger number of registers being saved/restored (and
    does not cover secondary things like GBR reload or stack canary stuff,
    which appears to possibly be a significant chunk of space).


    Goes and adds a stat for averages:
       Prolog:  8%  (avg= 24 bytes)
       Epilog:  4%  (avg= 12 bytes)
       Body  : 88%  (avg=260 bytes)

    With 959 functions counted (excluding empty functions/prototypes).


    ....

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Mon Apr 22 14:13:41 2024
    From Newsgroup: comp.arch

    On Sat, 20 Apr 2024 17:07:11 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    And, hey, I'm not the first guy to get sunk because of forgetting what
    lies under the tip of the iceberg that's above the water.

    That also happened to the captain of the _Titanic_.

    Concer-tina-tanic !?!

    Oh, dear. This discussion has inspired me to rework the basic design
    of Concertina II _yet again_!

    The new design, not yet online, will have the following features:

    The code stream will continue to be divided into 256-bit blocks.

    However, block headers wil be eliminated. Instead, this functionality
    will be subsumed into the instruction set.

    Case I:

    Indicating that from 1 to 7 32-bit instruction slots in a block are
    not used for instructions, but instead may contain pseudo-immediates
    will be achieved by:

    Placing a two-address register-to-register operate instruction in the
    first instruction slot in a block. These instructions will have a
    three-bit field which, if nonzero, indicates the amount of space
    reserved.

    To avoid waste, when such an instruction is present in any slot other
    than the first, that field will have the following function:

    If nonzero, it points to an instruction slot (slots 1 through 7, in
    the second through eighth positions) and a duplicate copy of the
    instruction in that slot will be placed in the instruction stream
    immediately following the instruction with that field.

    The following special conditions apply:

    If the instruction slot contains a pair of 16-bit instructions, only
    the first of those instructions is so inserted for execution.

    The instruction slot may not be one that is reserved for
    pseudo-immediates, except that it may be the _first_ such slot, in
    which case, the first 16 bits of that slot are taken as a 16-bit
    instruction, with the format indicated by the first bit (as opposed to
    the usual 17th bit) of that instruction slot's contents.

    So it's possible to reserve an odd multiple of 16 bits for
    pseudo-immediates, so as to avoid waste.

    Case II:

    Instructions longer than 32 bits are specified by being of the form:

    The first instruction slot:

    11111
    00
    (3 bits) length in instruction slots, from 2 to 7
    (22 bits) rest of the first part of the instruction

    All remaining instruction slots:

    11111
    (3 bits) position within instruction, from 2 to 7
    (24 bits) rest of this part of the instruction

    This mechanism, however, will _also_ be used for VLIW functionality or
    prefix functionality which was formerly in block headers.

    In that case, the first instruction slot, and the remaining
    instruction slots, no longer need to be contiguous; instead, ordinary
    32-bit instructions or pairs of 16-bit instlructions can occur between
    the portions of the ensemble of prefixed instructions formed by this
    means.

    And there is a third improvement.

    When Case I above is in effect, the block in which space for
    pseudo-immediates is reserved will be stored in an internal register
    in the processor.

    Subsequent blocks can contain operate instructions with
    pseudo-immediate operands even if no space for pseudo-immediates is
    reserved in those blocks. In that case, the retained copy of the last
    block encountered in which pseudo-immediates were reserved shall be
    referenced instead.

    I think these changes will improve code density... or, at least, they
    will make it appear that no space is obviously forced to be wasted,
    even if no real improvement in code density results.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Mon Apr 22 16:22:11 2024
    From Newsgroup: comp.arch

    On Mon, 22 Apr 2024 14:13:41 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    On Sat, 20 Apr 2024 17:07:11 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    And, hey, I'm not the first guy to get sunk because of forgetting what
    lies under the tip of the iceberg that's above the water.

    That also happened to the captain of the _Titanic_.

    Concer-tina-tanic !?!

    Oh, dear. This discussion has inspired me to rework the basic design
    of Concertina II _yet again_!

    The new design, not yet online, will have the following features:

    The code stream will continue to be divided into 256-bit blocks.

    However, block headers wil be eliminated. Instead, this functionality
    will be subsumed into the instruction set.

    Case I:

    Indicating that from 1 to 7 32-bit instruction slots in a block are
    not used for instructions, but instead may contain pseudo-immediates
    will be achieved by:

    Placing a two-address register-to-register operate instruction in the
    first instruction slot in a block. These instructions will have a
    three-bit field which, if nonzero, indicates the amount of space
    reserved.

    To avoid waste, when such an instruction is present in any slot other
    than the first, that field will have the following function:

    If nonzero, it points to an instruction slot (slots 1 through 7, in
    the second through eighth positions) and a duplicate copy of the
    instruction in that slot will be placed in the instruction stream
    immediately following the instruction with that field.

    The following special conditions apply:

    If the instruction slot contains a pair of 16-bit instructions, only
    the first of those instructions is so inserted for execution.

    The instruction slot may not be one that is reserved for
    pseudo-immediates, except that it may be the _first_ such slot, in
    which case, the first 16 bits of that slot are taken as a 16-bit
    instruction, with the format indicated by the first bit (as opposed to
    the usual 17th bit) of that instruction slot's contents.

    So it's possible to reserve an odd multiple of 16 bits for
    pseudo-immediates, so as to avoid waste.

    Case II:

    Instructions longer than 32 bits are specified by being of the form:

    The first instruction slot:

    11111
    00
    (3 bits) length in instruction slots, from 2 to 7
    (22 bits) rest of the first part of the instruction

    All remaining instruction slots:

    11111
    (3 bits) position within instruction, from 2 to 7
    (24 bits) rest of this part of the instruction

    This mechanism, however, will _also_ be used for VLIW functionality or
    prefix functionality which was formerly in block headers.

    In that case, the first instruction slot, and the remaining
    instruction slots, no longer need to be contiguous; instead, ordinary
    32-bit instructions or pairs of 16-bit instlructions can occur between
    the portions of the ensemble of prefixed instructions formed by this
    means.

    And there is a third improvement.

    When Case I above is in effect, the block in which space for >pseudo-immediates is reserved will be stored in an internal register
    in the processor.

    Subsequent blocks can contain operate instructions with
    pseudo-immediate operands even if no space for pseudo-immediates is
    reserved in those blocks. In that case, the retained copy of the last
    block encountered in which pseudo-immediates were reserved shall be >referenced instead.

    I think these changes will improve code density... or, at least, they
    will make it appear that no space is obviously forced to be wasted,
    even if no real improvement in code density results.

    The page has now been updated to reflect this modified design.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Mon Apr 22 19:36:54 2024
    From Newsgroup: comp.arch

    On Mon, 22 Apr 2024 16:22:11 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    On Mon, 22 Apr 2024 14:13:41 -0600, John Savard ><quadibloc@servername.invalid> wrote:

    The first instruction slot:

    11111
    00
    (3 bits) length in instruction slots, from 2 to 7
    (22 bits) rest of the first part of the instruction

    All remaining instruction slots:

    11111
    (3 bits) position within instruction, from 2 to 7
    (24 bits) rest of this part of the instruction

    The page has now been updated to reflect this modified design.

    And I thought I was on to something.

    The functionality - pseudo-immediates and VLIW features - was all the
    same, but now everything was so much simpler. The only thing that
    needed to be in a header, the three-bit field that reserved space for pseudo-immediates, now had just three bits of overhead.

    Everything else followed a normal instruction model, instead of a
    complicated header.

    But... if I use a header with 22 bits usable to turn instruction words
    that have 22 bits available...

    into instructions that are _longer_ than 32 bits...

    well, guess what?

    If I use half the opcode space for four-word instructions, then one
    header with 22 bits available can add 7 bits to each of three
    subsequent instructions.

    However, 24 plus 7 is 31.

    So I'm stuck at putting two instructions in three words even for a
    modest extension of the instruction set...

    never mind adding a whole bunch of bits for stuff like predication!

    I can tease out a couple of extra bits, so that I have a 22-bit
    starting word, but 26 bits in each following one, by replacing the
    three bit "position" field with a field that just contains 0 in every instruction slot but the last one, indicated with a 1.

    With 26 bits, to get 33 bits - all I need for a nice expansion of the instruction set to its "full" form - I need to add seven bits to each
    one, so that now does allow one starting word to prefix three
    instructions.

    Still not great, but adequate. And the first word doesn't really need
    a length field either, it just needs to indicate it's the first one.
    Which is how I had worked something like this before.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 23 01:53:26 2024
    From Newsgroup: comp.arch

    John Savard wrote:

    On Sat, 20 Apr 2024 17:07:11 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    And, hey, I'm not the first guy to get sunk because of forgetting what
    lies under the tip of the iceberg that's above the water.

    That also happened to the captain of the _Titanic_.

    Concer-tina-tanic !?!

    Oh, dear. This discussion has inspired me to rework the basic design
    of Concertina II _yet again_!

    I suggest it is time for Concertina III.......

    The new design, not yet online, will have the following features:

    The code stream will continue to be divided into 256-bit blocks.

    Why not a whole cache line ??

    However, block headers wil be eliminated. Instead, this functionality
    will be subsumed into the instruction set.

    Case I:

    Indicating that from 1 to 7 32-bit instruction slots in a block are
    not used for instructions, but instead may contain pseudo-immediates
    will be achieved by:

    Placing a two-address register-to-register operate instruction in the
    first instruction slot in a block. These instructions will have a
    three-bit field which, if nonzero, indicates the amount of space
    reserved.

    To avoid waste, when such an instruction is present in any slot other
    than the first, that field will have the following function:

    If nonzero, it points to an instruction slot (slots 1 through 7, in
    the second through eighth positions) and a duplicate copy of the
    instruction in that slot will be placed in the instruction stream
    immediately following the instruction with that field.

    The following special conditions apply:

    If the instruction slot contains a pair of 16-bit instructions, only
    the first of those instructions is so inserted for execution.

    The instruction slot may not be one that is reserved for
    pseudo-immediates, except that it may be the _first_ such slot, in
    which case, the first 16 bits of that slot are taken as a 16-bit
    instruction, with the format indicated by the first bit (as opposed to
    the usual 17th bit) of that instruction slot's contents.

    So it's possible to reserve an odd multiple of 16 bits for
    pseudo-immediates, so as to avoid waste.

    Case II:

    Instructions longer than 32 bits are specified by being of the form:

    The first instruction slot:

    11111
    00
    (3 bits) length in instruction slots, from 2 to 7
    (22 bits) rest of the first part of the instruction

    All remaining instruction slots:

    11111
    (3 bits) position within instruction, from 2 to 7
    (24 bits) rest of this part of the instruction

    This mechanism, however, will _also_ be used for VLIW functionality or
    prefix functionality which was formerly in block headers.

    In that case, the first instruction slot, and the remaining
    instruction slots, no longer need to be contiguous; instead, ordinary
    32-bit instructions or pairs of 16-bit instlructions can occur between
    the portions of the ensemble of prefixed instructions formed by this
    means.

    And there is a third improvement.

    When Case I above is in effect, the block in which space for pseudo-immediates is reserved will be stored in an internal register
    in the processor.

    Subsequent blocks can contain operate instructions with
    pseudo-immediate operands even if no space for pseudo-immediates is
    reserved in those blocks. In that case, the retained copy of the last
    block encountered in which pseudo-immediates were reserved shall be referenced instead.

    I think these changes will improve code density... or, at least, they
    will make it appear that no space is obviously forced to be wasted,
    even if no real improvement in code density results.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Mon Apr 22 20:19:36 2024
    From Newsgroup: comp.arch

    On Tue, 23 Apr 2024 01:53:26 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    I suggest it is time for Concertina III.......

    If the old Concertina II were worth keeping...

    Why not a whole cache line ??

    That would be one way to allow the overhead of a block prefix to be
    minimized.

    But that starts to look like just having mode bits for an entire
    program.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Mon Apr 22 20:22:12 2024
    From Newsgroup: comp.arch

    On Mon, 22 Apr 2024 19:36:54 -0600, John Savard
    <quadibloc@servername.invalid> wrote:


    I can tease out a couple of extra bits, so that I have a 22-bit
    starting word, but 26 bits in each following one, by replacing the
    three bit "position" field with a field that just contains 0 in every >instruction slot but the last one, indicated with a 1.

    With 26 bits, to get 33 bits - all I need for a nice expansion of the >instruction set to its "full" form - I need to add seven bits to each
    one, so that now does allow one starting word to prefix three
    instructions.

    Still not great, but adequate. And the first word doesn't really need
    a length field either, it just needs to indicate it's the first one.
    Which is how I had worked something like this before.

    But fully half the opcode space is allocated to 16-bit instructions.
    EVen though that half doesn't really play nice with other things, it's
    too tempting a target to ignore. But the price would be losing the
    fully parallel nature of decoding.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From George Neuner@gneuner2@comcast.net to comp.arch on Mon Apr 22 23:09:43 2024
    From Newsgroup: comp.arch

    On Sun, 21 Apr 2024 00:43:21 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Address arithmetic is ADD only and does not care about signs or
    overflow. There is no concept of a negative base register or a
    negative index register (or, for that matter, a negative displace-
    ment), overflow, underflow, carry, ...

    Stack frame pointers often point to the middle of the frame and need
    to access data using both positive and negative displacements.

    Some GC schemes use negative displacements to access object headers.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Tue Apr 23 00:54:00 2024
    From Newsgroup: comp.arch

    On Mon, 22 Apr 2024 20:22:12 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    But fully half the opcode space is allocated to 16-bit instructions.
    EVen though that half doesn't really play nice with other things, it's
    too tempting a target to ignore. But the price would be losing the
    fully parallel nature of decoding.

    After heading out to buy groceries, my head cleared enough to discard
    the various complicated and bizarre schemes I was considering to deal
    with the issue, and instead to drastically reduce the overhead for the instructions longer than 32 bits, now that this had become a major
    concern due to also usiing this format for prefixed instructions as
    well, in a simple and straightforward manner.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 23 17:58:41 2024
    From Newsgroup: comp.arch

    George Neuner wrote:

    On Sun, 21 Apr 2024 00:43:21 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Address arithmetic is ADD only and does not care about signs or
    overflow. There is no concept of a negative base register or a
    negative index register (or, for that matter, a negative displace-
    ment), overflow, underflow, carry, ...

    Stack frame pointers often point to the middle of the frame and need
    to access data using both positive and negative displacements.

    Yes, one accesses callee saved registers with positive displacements
    and local variables with negative accesses. One simply needs to know
    where the former stops and the later begins. ENTER and EXIT know this
    by the register count and by the stack allocation size.

    Some GC schemes use negative displacements to access object headers.

    Those are negative displacements not negative bases or indexes.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Tue Apr 23 16:53:16 2024
    From Newsgroup: comp.arch

    On 4/23/2024 1:54 AM, John Savard wrote:
    On Mon, 22 Apr 2024 20:22:12 -0600, John Savard <quadibloc@servername.invalid> wrote:

    But fully half the opcode space is allocated to 16-bit instructions.
    EVen though that half doesn't really play nice with other things, it's
    too tempting a target to ignore. But the price would be losing the
    fully parallel nature of decoding.

    After heading out to buy groceries, my head cleared enough to discard
    the various complicated and bizarre schemes I was considering to deal
    with the issue, and instead to drastically reduce the overhead for the instructions longer than 32 bits, now that this had become a major
    concern due to also usiing this format for prefixed instructions as
    well, in a simple and straightforward manner.


    You know, one could just be like, say:
    xxxx-xxxx-xxxx-xxx0 //16-bit op
    xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-xx01 //32-bit op
    xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-x011 //32-bit op
    xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-0111 //32-bit op
    xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-1111 //jumbo prefix (64+)

    And call it "good enough"...


    Then, say (6b registers):
    zzzz-mmmm-nnnn-zzz0 //16-bit op (2R)
    zzzz-tttt-ttss-ssss nnnn-nnpp-zzzz-xxx1 //32-bit op (3R)
    iiii-iiii-iiss-ssss nnnn-nnpp-zzzz-xxx1 //32-bit op (3RI, Imm10)
    iiii-iiii-iiii-iiii nnnn-nnpp-zzzz-xxx1 //32-bit op (2RI, Imm16)
    iiii-iiii-iiii-iiii iiii-iipp-zzzz-xxx1 //32-bit op (Branch)


    Or (5b registers):
    zzzz-mmmm-nnnn-zzz0 //16-bit op (2R)
    zzzz-zttt-ttzs-ssss nnnn-nzpp-zzzz-xxx1 //32-bit op (3R)
    iiii-iiii-iiis-ssss nnnn-nzpp-zzzz-xxx1 //32-bit op (3RI, Imm11)
    iiii-iiii-iiii-iiii nnnn-nzpp-zzzz-xxx1 //32-bit op (2RI, Imm16)
    iiii-iiii-iiii-iiii iiii-iipp-zzzz-xxx1 //32-bit op (Branch)

    ...



    John Savard

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 23 22:55:50 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/23/2024 1:54 AM, John Savard wrote:
    On Mon, 22 Apr 2024 20:22:12 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    But fully half the opcode space is allocated to 16-bit instructions.
    EVen though that half doesn't really play nice with other things, it's
    too tempting a target to ignore. But the price would be losing the
    fully parallel nature of decoding.

    After heading out to buy groceries, my head cleared enough to discard
    the various complicated and bizarre schemes I was considering to deal
    with the issue, and instead to drastically reduce the overhead for the
    instructions longer than 32 bits, now that this had become a major
    concern due to also usiing this format for prefixed instructions as
    well, in a simple and straightforward manner.


    You know, one could just be like, say:
    xxxx-xxxx-xxxx-xxx0 //16-bit op
    xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-xx01 //32-bit op
    xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-x011 //32-bit op
    xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-0111 //32-bit op
    xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-1111 //jumbo prefix (64+)

    And call it "good enough"...


    Then, say (6b registers):
    zzzz-mmmm-nnnn-zzz0 //16-bit op (2R)
    zzzz-tttt-ttss-ssss nnnn-nnpp-zzzz-xxx1 //32-bit op (3R)
    iiii-iiii-iiss-ssss nnnn-nnpp-zzzz-xxx1 //32-bit op (3RI, Imm10)
    iiii-iiii-iiii-iiii nnnn-nnpp-zzzz-xxx1 //32-bit op (2RI, Imm16)
    iiii-iiii-iiii-iiii iiii-iipp-zzzz-xxx1 //32-bit op (Branch)


    Or (5b registers):
    zzzz-mmmm-nnnn-zzz0 //16-bit op (2R)
    zzzz-zttt-ttzs-ssss nnnn-nzpp-zzzz-xxx1 //32-bit op (3R)
    iiii-iiii-iiis-ssss nnnn-nzpp-zzzz-xxx1 //32-bit op (3RI, Imm11)
    iiii-iiii-iiii-iiii nnnn-nzpp-zzzz-xxx1 //32-bit op (2RI, Imm16)
    iiii-iiii-iiii-iiii iiii-iipp-zzzz-xxx1 //32-bit op (Branch)

    ....

    Punt on the 16-bit instructions::

    000110 CONDI rrrrr PRED xxthen xxelse
    000111 ddddd rrrrr SHF rwidth roffst
    001001 ddddd rrrrr DscLd MemOp rrrrr
    001010 ddddd rrrrr I12Sd 2-OPR rrrrr
    001100 ddddd rrrrr I12 3OP rrrrr rrrrr
    001101 ddddd rrrrr I12Sd 1-OPERA TIONx
    01100 bitnum rrrrr 16-bit-displacement
    011010 CONDI rrrrr 16-bit-displacement
    011110 26-bit-displacementtttttttttttt
    011111 26-bit-displacementtttttttttttt
    100000
    to
    101110 ddddd rrrrr 16-bit-displacement
    110000
    to
    111100 ddddd rrrrr 16-bit-immediateeee


    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Thu Apr 25 01:00:21 2024
    From Newsgroup: comp.arch

    On Sat, 20 Apr 2024 18:06:22 -0600, John Savard wrote:

    Since there was only one set of arithmetic instrucions, that meant that
    when you wrote code to operate on unsigned values, you had to remember
    that the normal names of the condition code values were oriented around signed arithmetic.

    I thought architectures typically had separate condition codes for “carry” versus “overflow”. That way, you didn’t need signed versus unsigned versions of add, subtract and compare; it was just a matter of looking at
    the right condition codes on the result.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 25 02:50:09 2024
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro wrote:

    On Sat, 20 Apr 2024 18:06:22 -0600, John Savard wrote:

    Since there was only one set of arithmetic instrucions, that meant that
    when you wrote code to operate on unsigned values, you had to remember
    that the normal names of the condition code values were oriented around
    signed arithmetic.

    I thought architectures typically had separate condition codes for “carry”
    versus “overflow”. That way, you didn’t need signed versus unsigned versions of add, subtract and compare; it was just a matter of looking at the right condition codes on the result.

    Maybe now with 4-or-5-bit condition codes yes,
    But the early machines (360) with 2-bit codes were already constricted.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Thu Apr 25 03:28:36 2024
    From Newsgroup: comp.arch

    On Thu, 25 Apr 2024 02:50:09 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    On Sat, 20 Apr 2024 18:06:22 -0600, John Savard wrote:

    Since there was only one set of arithmetic instrucions, that meant
    that when you wrote code to operate on unsigned values, you had to
    remember that the normal names of the condition code values were
    oriented around signed arithmetic.

    I thought architectures typically had separate condition codes for
    “carry” versus “overflow”. That way, you didn’t need signed versus >> unsigned versions of add, subtract and compare; it was just a matter of
    looking at the right condition codes on the result.

    Maybe now with 4-or-5-bit condition codes yes,
    But the early machines (360) with 2-bit codes were already constricted.

    The DEC PDP-6, from around 1964, same time as the System/360, had separate carry and overflow flags.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Wed Apr 24 23:12:01 2024
    From Newsgroup: comp.arch

    On Thu, 25 Apr 2024 01:00:21 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:
    On Sat, 20 Apr 2024 18:06:22 -0600, John Savard wrote:

    Since there was only one set of arithmetic instrucions, that meant that
    when you wrote code to operate on unsigned values, you had to remember
    that the normal names of the condition code values were oriented around
    signed arithmetic.

    I thought architectures typically had separate condition codes for carry >versus overflow. That way, you didnt need signed versus unsigned
    versions of add, subtract and compare; it was just a matter of looking at >the right condition codes on the result.

    Yes; I thought that was the same as what I just said.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From George Neuner@gneuner2@comcast.net to comp.arch on Thu Apr 25 17:01:55 2024
    From Newsgroup: comp.arch

    On Tue, 23 Apr 2024 17:58:41 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    George Neuner wrote:

    On Sun, 21 Apr 2024 00:43:21 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Address arithmetic is ADD only and does not care about signs or
    overflow. There is no concept of a negative base register or a
    negative index register (or, for that matter, a negative displace-
    ment), overflow, underflow, carry, ...

    Stack frame pointers often point to the middle of the frame and need
    to access data using both positive and negative displacements.

    Yes, one accesses callee saved registers with positive displacements
    and local variables with negative accesses. One simply needs to know
    where the former stops and the later begins. ENTER and EXIT know this
    by the register count and by the stack allocation size.

    Some GC schemes use negative displacements to access object headers.

    Those are negative displacements not negative bases or indexes.

    I was reacting to your message (quoted fully above) which,
    paraphrased, says "address arithmetic is add only and there is no
    concept of a negative displacement".

    In one sense you are correct: the result of the calculation has to be considered as unsigned in the range 0..max_memory ... ie. there is no
    concept of negative *address*.

    However, the components being added to form the address, I believe are
    a different matter.

    I agree that negative base is meaningless.

    However, negative index and negative displacement both do have
    meaning. The inclusion of specialized index registers is debatable
    [I'm in the GPR camp], but I do believe that index and displacement
    *values* both always should be considered as signed.

    YMMV.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Thu Apr 25 20:02:09 2024
    From Newsgroup: comp.arch

    On 4/25/2024 4:01 PM, George Neuner wrote:
    On Tue, 23 Apr 2024 17:58:41 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    George Neuner wrote:

    On Sun, 21 Apr 2024 00:43:21 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Address arithmetic is ADD only and does not care about signs or
    overflow. There is no concept of a negative base register or a
    negative index register (or, for that matter, a negative displace-
    ment), overflow, underflow, carry, ...

    Stack frame pointers often point to the middle of the frame and need
    to access data using both positive and negative displacements.

    Yes, one accesses callee saved registers with positive displacements
    and local variables with negative accesses. One simply needs to know
    where the former stops and the later begins. ENTER and EXIT know this
    by the register count and by the stack allocation size.

    Some GC schemes use negative displacements to access object headers.

    Those are negative displacements not negative bases or indexes.

    I was reacting to your message (quoted fully above) which,
    paraphrased, says "address arithmetic is add only and there is no
    concept of a negative displacement".

    In one sense you are correct: the result of the calculation has to be considered as unsigned in the range 0..max_memory ... ie. there is no
    concept of negative *address*.

    However, the components being added to form the address, I believe are
    a different matter.

    I agree that negative base is meaningless.

    However, negative index and negative displacement both do have
    meaning. The inclusion of specialized index registers is debatable
    [I'm in the GPR camp], but I do believe that index and displacement
    *values* both always should be considered as signed.


    Agreed in the sense that negative displacements exist.

    However, can note that positive displacements tend to be significantly
    more common than negative ones. Whether or not it makes sense to have a negative displacement, depending mostly on the probability of greater
    than half of the missed displacements being negative.

    From what I can tell, this seems to be:
    ~ 10 bits, scaled.
    ~ 13 bits, unscaled.


    So, say, an ISA like RISC-V might have had a slightly hit rate with
    unsigned displacements than with signed displacements, but if one added
    1 or 2 bits, signed would have still been a clear winner (or, with 1 or
    2 fewer bits, unsigned a clear winner).

    I ended up going with signed displacements for XG2, but it was pretty
    close to break-even in this case (when expanding from the 9-bit unsigned displacements in Baseline).


    Granted, all signed or all-unsigned might be better from an ISA design consistency POV.


    If one had 16-bit displacements, then unscaled displacements would make
    sense; otherwise scaled displacements seem like a win (misaligned displacements being much less common than aligned displacements).

    But, admittedly, main reason I went with unscaled for GBR-rel and PC-rel Load/Store, was because using scaled displacements here would have
    required more relocation types (nevermind if the hit rate for unscaled
    9-bit displacements is "pretty weak").

    Though, did end up later adding specialized Scaled GBR-Rel Load/Store
    ops (to improve code density), so it might have been better in
    retrospect had I instead just went the "keep it scaled and add more
    reloc types to compensate" option.


    ...


    YMMV.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 26 13:25:03 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/25/2024 4:01 PM, George Neuner wrote:
    On Tue, 23 Apr 2024 17:58:41 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:


    Agreed in the sense that negative displacements exist.

    However, can note that positive displacements tend to be significantly
    more common than negative ones. Whether or not it makes sense to have a negative displacement, depending mostly on the probability of greater
    than half of the missed displacements being negative.

    From what I can tell, this seems to be:
    ~ 10 bits, scaled.
    ~ 13 bits, unscaled.


    So, say, an ISA like RISC-V might have had a slightly hit rate with
    unsigned displacements than with signed displacements, but if one added
    1 or 2 bits, signed would have still been a clear winner (or, with 1 or
    2 fewer bits, unsigned a clear winner).

    I ended up going with signed displacements for XG2, but it was pretty
    close to break-even in this case (when expanding from the 9-bit unsigned displacements in Baseline).


    Granted, all signed or all-unsigned might be better from an ISA design consistency POV.


    If one had 16-bit displacements, then unscaled displacements would make sense; otherwise scaled displacements seem like a win (misaligned displacements being much less common than aligned displacements).

    What we need is ~16-bit displacements where 82½%-91¼% are positive.

    How does one use a frame pointer without negative displacements ??

    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values

    But, admittedly, main reason I went with unscaled for GBR-rel and PC-rel Load/Store, was because using scaled displacements here would have
    required more relocation types (nevermind if the hit rate for unscaled
    9-bit displacements is "pretty weak").

    Though, did end up later adding specialized Scaled GBR-Rel Load/Store
    ops (to improve code density), so it might have been better in
    retrospect had I instead just went the "keep it scaled and add more
    reloc types to compensate" option.


    ....


    YMMV.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Apr 26 15:34:57 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    What we need is ~16-bit displacements where 82½%-91¼% are positive.

    What are these funny numbers about?

    Do you mean that you want number ranges like -11468..54067 (82.5%
    positive) or -5734..59801 (91.25% positive)? Which one of those? And
    why not, say -8192..57343 (87.5% positive)?

    How does one use a frame pointer without negative displacements ??

    You let it point to the lowest address you want to access. That moves
    the problem to unwinding frame pointer chains where the unwinder does
    not know the frame-specific difference between the frame pointer and
    the pointer of the next frame.

    An alternative is to have a frame-independent difference that leaves
    enough room that, say 90% (or 99%, or whatever) of the frames don't
    need negative offsets from that frame.

    Likewise, if you have signed displacements, and are unhappy about the
    skewed usage, you can let the frame pointer point at an offset from
    the pointer to the next fram such that the usage is less skewed.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Fri Apr 26 12:30:24 2024
    From Newsgroup: comp.arch

    On 4/26/2024 8:25 AM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/25/2024 4:01 PM, George Neuner wrote:
    On Tue, 23 Apr 2024 17:58:41 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:


    Agreed in the sense that negative displacements exist.

    However, can note that positive displacements tend to be significantly
    more common than negative ones. Whether or not it makes sense to have
    a negative displacement, depending mostly on the probability of
    greater than half of the missed displacements being negative.

     From what I can tell, this seems to be:
       ~ 10 bits, scaled.
       ~ 13 bits, unscaled.


    So, say, an ISA like RISC-V might have had a slightly hit rate with
    unsigned displacements than with signed displacements, but if one
    added 1 or 2 bits, signed would have still been a clear winner (or,
    with 1 or 2 fewer bits, unsigned a clear winner).

    I ended up going with signed displacements for XG2, but it was pretty
    close to break-even in this case (when expanding from the 9-bit
    unsigned displacements in Baseline).


    Granted, all signed or all-unsigned might be better from an ISA design
    consistency POV.


    If one had 16-bit displacements, then unscaled displacements would
    make sense; otherwise scaled displacements seem like a win (misaligned
    displacements being much less common than aligned displacements).

    What we need is ~16-bit displacements where 82½%-91¼% are positive.


    I was seeing stats more like 99.8% positive, 0.2% negative.


    There was enough of a bias that, below 10 bits, if one takes all the
    remaining cases, zero extending would always win, until reaching 10
    bits, when the number of missed reaches 50% negative (along with
    positive displacements larger than 512).

    So, one can make a choice: -512..511, or 0..1023, ...

    In XG2, I ended up with -512..511, for pros or cons (for some programs,
    this choice is optimal, for others it is not).

    Where, when scaled for QWORD, this is +/- 4K.


    If one had a 16-bit displacement, it would be a choice between +/- 32K,
    or (scaled) +/- 256K, or 0..512K, ...

    For the special purpose "LEA.Q (GBR, Disp16), Rn" instruction, I ended
    up going unsigned, where for a lot of the programs I am dealing with,
    this is big enough to cover ".data" and part of ".bss", generally used
    for arrays which need the larger displacements (the compiler lays things
    out so that most of the commonly used variables are closer to the start
    of ".data", so can use smaller displacements).

    Does implicitly require that all non-trivial global arrays have at least 64-bit alignment.


    Note that seemingly both BGBCC and GCC have variations on this
    optimization, though in GCC's case it requires special command-line
    options ("-fdata-sections", etc).



    How does one use a frame pointer without negative displacements ??

    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values


    In my case, all of these are [SP+Disp], granted, there is no frame
    pointer and stack frames are fixed-size in BGBCC.

    This is typically with a frame layout like:
    Argument/Spill space
    -- Frame Top
    Register Save
    (Stack Canary)
    Local arrays/structs
    Local variables
    Argument/Spill Space
    -- Frame Bottom

    Contrast with traditional x86 layout, which puts saved registers and
    local variables near the frame-pointer, which points near the top of the
    stack frame.

    Though, in a majority of functions, the MOV.L and MOV.Q functions have a
    big enough displacement to cover the whole frame (excludes functions
    which have a lot of local arrays or similar, though overly large local
    arrays are auto-folded to using heap allocation, but at present this
    logic is based on the size of individual arrays rather than on the total combined size of the stack frame).


    Adding a frame pointer (with negative displacements) wouldn't make a big difference in XG2 Mode, but would be more of an issue for (pure)
    Baseline, where options are either to load the displacement into a
    register, or use a jumbo prefix.


    But, admittedly, main reason I went with unscaled for GBR-rel and
    PC-rel Load/Store, was because using scaled displacements here would
    have required more relocation types (nevermind if the hit rate for
    unscaled 9-bit displacements is "pretty weak").

    Though, did end up later adding specialized Scaled GBR-Rel Load/Store
    ops (to improve code density), so it might have been better in
    retrospect had I instead just went the "keep it scaled and add more
    reloc types to compensate" option.


    ....


    YMMV.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Apr 26 14:59:43 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    BGB wrote:

    If one had 16-bit displacements, then unscaled displacements would
    make sense; otherwise scaled displacements seem like a win (misaligned
    displacements being much less common than aligned displacements).

    What we need is ~16-bit displacements where 82½%-91¼% are positive.

    How does one use a frame pointer without negative displacements ??

    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values

    A sign extended 16-bit offsets would cover almost all such access needs
    so I really don't see the need for funny business.

    But if you really want a skewed range offset it could use something like excess-256 encoding which zero extends the immediate then subtract 256
    (or whatever) from it, to give offsets in the range -256..+65535-256.
    So an immediate value of 0 equals an offset of -256.




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 26 21:01:35 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    What we need is ~16-bit displacements where 82½%-91¼% are positive.

    What are these funny numbers about?

    In typical usages in MY 66000 ISA <only> one needs only 18 DW of
    negative addressing and we have a 16-bit displacement. So, technically
    it might get by at the 99% level with -32..+65500. Other usages might
    need a few more on the negative end of things so 1/8..1/16 in the
    negative direction, 7/8..15/16 in the positive.

    Do you mean that you want number ranges like -11468..54067 (82.5%
    positive) or -5734..59801 (91.25% positive)? Which one of those? And
    why not, say -8192..57343 (87.5% positive)?

    Roughly.

    How does one use a frame pointer without negative displacements ??

    You let it point to the lowest address you want to access. That moves
    the problem to unwinding frame pointer chains where the unwinder does
    not know the frame-specific difference between the frame pointer and
    the pointer of the next frame.

    An alternative is to have a frame-independent difference that leaves
    enough room that, say 90% (or 99%, or whatever) of the frames don't
    need negative offsets from that frame.

    Likewise, if you have signed displacements, and are unhappy about the
    skewed usage, you can let the frame pointer point at an offset from
    the pointer to the next fram such that the usage is less skewed.

    Such a hassle....

    - anton
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 26 21:07:24 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/26/2024 8:25 AM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/25/2024 4:01 PM, George Neuner wrote:
    On Tue, 23 Apr 2024 17:58:41 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:


    Agreed in the sense that negative displacements exist.

    However, can note that positive displacements tend to be significantly
    more common than negative ones. Whether or not it makes sense to have
    a negative displacement, depending mostly on the probability of
    greater than half of the missed displacements being negative.

     From what I can tell, this seems to be:
       ~ 10 bits, scaled.
       ~ 13 bits, unscaled.


    So, say, an ISA like RISC-V might have had a slightly hit rate with
    unsigned displacements than with signed displacements, but if one
    added 1 or 2 bits, signed would have still been a clear winner (or,
    with 1 or 2 fewer bits, unsigned a clear winner).

    I ended up going with signed displacements for XG2, but it was pretty
    close to break-even in this case (when expanding from the 9-bit
    unsigned displacements in Baseline).


    Granted, all signed or all-unsigned might be better from an ISA design
    consistency POV.


    If one had 16-bit displacements, then unscaled displacements would
    make sense; otherwise scaled displacements seem like a win (misaligned
    displacements being much less common than aligned displacements).

    What we need is ~16-bit displacements where 82½%-91¼% are positive.


    I was seeing stats more like 99.8% positive, 0.2% negative.

    After pulling out the calculator and thinking about the frames, My
    66000 needs no more than 18 DW of negative addressing. This is just
    over 0.2% as you indicate.


    There was enough of a bias that, below 10 bits, if one takes all the remaining cases, zero extending would always win, until reaching 10
    bits, when the number of missed reaches 50% negative (along with
    positive displacements larger than 512).

    So, one can make a choice: -512..511, or 0..1023, ...

    In XG2, I ended up with -512..511, for pros or cons (for some programs,
    this choice is optimal, for others it is not).

    Where, when scaled for QWORD, this is +/- 4K.


    If one had a 16-bit displacement, it would be a choice between +/- 32K,
    or (scaled) +/- 256K, or 0..512K, ...

    We looked at this in Mc88100 (scaling of the displacement). The drawback
    was that the ISA and linker were slightly mismatched: The linker wanted
    to use a single upper 16-bit LUI <if it were> over several LD/STs of potentially different sizes, and scaling of the displacement failed in
    those regards; so we dropped scaled displacements.

    For the special purpose "LEA.Q (GBR, Disp16), Rn" instruction, I ended
    up going unsigned, where for a lot of the programs I am dealing with,
    this is big enough to cover ".data" and part of ".bss", generally used
    for arrays which need the larger displacements (the compiler lays things
    out so that most of the commonly used variables are closer to the start
    of ".data", so can use smaller displacements).

    Not even an issue when one has universal constants.


    How does one use a frame pointer without negative displacements ??

    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values


    In my case, all of these are [SP+Disp], granted, there is no frame
    pointer and stack frames are fixed-size in BGBCC.

    This is typically with a frame layout like:
    Argument/Spill space
    -- Frame Top
    Register Save
    (Stack Canary)
    Local arrays/structs
    Local variables
    Argument/Spill Space
    -- Frame Bottom

    Contrast with traditional x86 layout, which puts saved registers and
    local variables near the frame-pointer, which points near the top of the stack frame.

    Though, in a majority of functions, the MOV.L and MOV.Q functions have a
    big enough displacement to cover the whole frame (excludes functions
    which have a lot of local arrays or similar, though overly large local arrays are auto-folded to using heap allocation, but at present this
    logic is based on the size of individual arrays rather than on the total combined size of the stack frame).


    Adding a frame pointer (with negative displacements) wouldn't make a big difference in XG2 Mode, but would be more of an issue for (pure)
    Baseline, where options are either to load the displacement into a
    register, or use a jumbo prefix.


    But, admittedly, main reason I went with unscaled for GBR-rel and
    PC-rel Load/Store, was because using scaled displacements here would
    have required more relocation types (nevermind if the hit rate for
    unscaled 9-bit displacements is "pretty weak").

    Though, did end up later adding specialized Scaled GBR-Rel Load/Store
    ops (to improve code density), so it might have been better in
    retrospect had I instead just went the "keep it scaled and add more
    reloc types to compensate" option.


    ....


    YMMV.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 26 21:16:28 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/26/2024 8:25 AM, MitchAlsup1 wrote:


    How does one use a frame pointer without negative displacements ??

    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values


    In my case, all of these are [SP+Disp], granted, there is no frame
    pointer and stack frames are fixed-size in BGBCC.

    I only have FP when the base language is block structured and scoped.
    Not C, C++ or FORTRAN, but Algol, ADA, Pascal: Yes.

    This is typically with a frame layout like:
    Argument/Spill space
    -- Frame Top
    Register Save
    (Stack Canary)
    Local arrays/structs
    Local variables
    Argument/Spill Space
    -- Frame Bottom

    Previous Argument/Result space
    { Register Save area
    Return Pointer }
    Local Descriptors -------------------\
    Local Variables |
    Dynamically allocated Stack space <--/
    My Argument/Result space

    When safe stack is in use, Register Save area and return pointer are
    placed on a separate stack not accessible with LD/ST instructions.

    Contrast with traditional x86 layout, which puts saved registers and
    local variables near the frame-pointer, which points near the top of the stack frame.

    Though, in a majority of functions, the MOV.L and MOV.Q functions have a
    big enough displacement to cover the whole frame (excludes functions
    which have a lot of local arrays or similar, though overly large local arrays are auto-folded to using heap allocation, but at present this
    logic is based on the size of individual arrays rather than on the total combined size of the stack frame).

    By making a Local Descriptor area on the stack, one can access the
    descriptors off of FP and access the dynamic stuff via that pointer.
    Both Local Descriptors and Local Variables may be allocated into
    registers and not actually exist on the stack.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sat Apr 27 14:58:52 2024
    From Newsgroup: comp.arch

    On 4/26/2024 4:07 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/26/2024 8:25 AM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/25/2024 4:01 PM, George Neuner wrote:
    On Tue, 23 Apr 2024 17:58:41 +0000, mitchalsup@aol.com (MitchAlsup1) >>>>> wrote:


    Agreed in the sense that negative displacements exist.

    However, can note that positive displacements tend to be
    significantly more common than negative ones. Whether or not it
    makes sense to have a negative displacement, depending mostly on the
    probability of greater than half of the missed displacements being
    negative.

     From what I can tell, this seems to be:
       ~ 10 bits, scaled.
       ~ 13 bits, unscaled.


    So, say, an ISA like RISC-V might have had a slightly hit rate with
    unsigned displacements than with signed displacements, but if one
    added 1 or 2 bits, signed would have still been a clear winner (or,
    with 1 or 2 fewer bits, unsigned a clear winner).

    I ended up going with signed displacements for XG2, but it was
    pretty close to break-even in this case (when expanding from the
    9-bit unsigned displacements in Baseline).


    Granted, all signed or all-unsigned might be better from an ISA
    design consistency POV.


    If one had 16-bit displacements, then unscaled displacements would
    make sense; otherwise scaled displacements seem like a win
    (misaligned displacements being much less common than aligned
    displacements).

    What we need is ~16-bit displacements where 82½%-91¼% are positive.


    I was seeing stats more like 99.8% positive, 0.2% negative.

    After pulling out the calculator and thinking about the frames, My 66000 needs no more than 18 DW of negative addressing. This is just
    over 0.2% as you indicate.


    OK.


    Not entirely sure I know what you mean be 'DW' here though...



    There was enough of a bias that, below 10 bits, if one takes all the
    remaining cases, zero extending would always win, until reaching 10
    bits, when the number of missed reaches 50% negative (along with
    positive displacements larger than 512).

    So, one can make a choice: -512..511, or 0..1023, ...

    In XG2, I ended up with -512..511, for pros or cons (for some
    programs, this choice is optimal, for others it is not).

    Where, when scaled for QWORD, this is +/- 4K.


    If one had a 16-bit displacement, it would be a choice between +/-
    32K, or (scaled) +/- 256K, or 0..512K, ...

    We looked at this in Mc88100 (scaling of the displacement). The drawback
    was that the ISA and linker were slightly mismatched: The linker wanted
    to use a single upper 16-bit LUI <if it were> over several LD/STs of potentially different sizes, and scaling of the displacement failed in
    those regards; so we dropped scaled displacements.


    This is partly why I initially went with unscaled for PC-rel and GBR-rel cases, since these were being used for globals, and the linker/reloc
    stage would need to deal with more complexity for the relocs in this case.

    For normal direct displacements, these will not typically be used for accessing globals or similar, or otherwise need relocs, so scaled made
    more sense here.


    Some scaled cases (for GBR) were later re-added mostly as an
    optimization case.

    And, when generating the binary all at once, it is possible to cluster
    the commonly used globals close together rather than have them scattered
    all across ".data" and ".bss" (so, most of the further reaches of these sections can be all the bulk arrays and similar).


    If doing separate compilation, likely the compiler would need to use the generic case (or possibly involve the arcane magic known as "linker relaxation").


    For the special purpose "LEA.Q (GBR, Disp16), Rn" instruction, I ended
    up going unsigned, where for a lot of the programs I am dealing with,
    this is big enough to cover ".data" and part of ".bss", generally used
    for arrays which need the larger displacements (the compiler lays
    things out so that most of the commonly used variables are closer to
    the start of ".data", so can use smaller displacements).

    Not even an issue when one has universal constants.


    In this case:
    LEA.B (GBR, Disp33s), Rn
    Needs 64 bits to encode, whereas:
    LEA.Q (GBR, Disp16u), Rn
    Can be encoded in 32 bits (but only applicable if the array is within
    512K of GBR).

    This saves some space for loading the address of a global array.

    Though, as-is, string literals still always need the longer form:
    LEA.B (PC, Disp33s), Rn
    ...

    Similar also applies for function pointers.



    How does one use a frame pointer without negative displacements ??

    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values


    In my case, all of these are [SP+Disp], granted, there is no frame
    pointer and stack frames are fixed-size in BGBCC.

    This is typically with a frame layout like:
       Argument/Spill space
       -- Frame Top
       Register Save
       (Stack Canary)
       Local arrays/structs
       Local variables
       Argument/Spill Space
       -- Frame Bottom

    Contrast with traditional x86 layout, which puts saved registers and
    local variables near the frame-pointer, which points near the top of
    the stack frame.

    Though, in a majority of functions, the MOV.L and MOV.Q functions have
    a big enough displacement to cover the whole frame (excludes functions
    which have a lot of local arrays or similar, though overly large local
    arrays are auto-folded to using heap allocation, but at present this
    logic is based on the size of individual arrays rather than on the
    total combined size of the stack frame).


    Adding a frame pointer (with negative displacements) wouldn't make a
    big difference in XG2 Mode, but would be more of an issue for (pure)
    Baseline, where options are either to load the displacement into a
    register, or use a jumbo prefix.


    But, admittedly, main reason I went with unscaled for GBR-rel and
    PC-rel Load/Store, was because using scaled displacements here would
    have required more relocation types (nevermind if the hit rate for
    unscaled 9-bit displacements is "pretty weak").

    Though, did end up later adding specialized Scaled GBR-Rel
    Load/Store ops (to improve code density), so it might have been
    better in retrospect had I instead just went the "keep it scaled and
    add more reloc types to compensate" option.


    ....


    YMMV.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sat Apr 27 14:59:30 2024
    From Newsgroup: comp.arch

    On 4/26/2024 1:59 PM, EricP wrote:
    MitchAlsup1 wrote:
    BGB wrote:

    If one had 16-bit displacements, then unscaled displacements would
    make sense; otherwise scaled displacements seem like a win
    (misaligned displacements being much less common than aligned
    displacements).

    What we need is ~16-bit displacements where 82½%-91¼% are positive.

    How does one use a frame pointer without negative displacements ??

    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values

    A sign extended 16-bit offsets would cover almost all such access needs
    so I really don't see the need for funny business.

    But if you really want a skewed range offset it could use something like excess-256 encoding which zero extends the immediate then subtract 256
    (or whatever) from it, to give offsets in the range -256..+65535-256.
    So an immediate value of 0 equals an offset of -256.


    Yeah, my thinking was that by the time one has 16 bits for Load/Store displacements, they could almost just go +/- 32K and call it done.

    But, much smaller than this, there is an advantage to scaling the displacements.




    In other news, got around to getting the RISC-V code to build in PIE
    mode for Doom (by using "riscv64-unknown-linux-gnu-*").

    Can note that RV64 code density takes a hit in this case:
    RV64: 299K (.text)
    XG2 : 284K (.text)

    So, apparently using this version of GCC and using "-fPIE" works in my
    favor regarding code density...


    I guess a question is what FDPIC would do if GCC supported it, since
    this would be the closest direct analog to my own ABI.


    I guess some people are dragging their feet on FDPIC, as there is some
    debate as to whether or not NOMMU makes sense for RISC-V, along with its associated performance impact if used.

    In my case, if I wanted to go over to simple base-relocatable images,
    this would technically eliminate the need for GBR reloading.

    Checks:
    Simple base-relocatable case actually currently generates bigger
    binaries, I suspect because in this case it is less space-efficient to
    use PC-rel vs GBR-rel.

    Went and added a "pbostatic" option, which sidesteps saving and
    restoring GBR (making the simplifying assumption that functions will
    never be called from outside the current binary).

    This saves roughly 4K (Doom's ".text" shrinks to 280K).


    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 27 20:37:34 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/26/2024 1:59 PM, EricP wrote:
    MitchAlsup1 wrote:
    BGB wrote:

    If one had 16-bit displacements, then unscaled displacements would
    make sense; otherwise scaled displacements seem like a win
    (misaligned displacements being much less common than aligned
    displacements).

    What we need is ~16-bit displacements where 82½%-91¼% are positive.

    How does one use a frame pointer without negative displacements ??

    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values

    A sign extended 16-bit offsets would cover almost all such access needs
    so I really don't see the need for funny business.

    But if you really want a skewed range offset it could use something like
    excess-256 encoding which zero extends the immediate then subtract 256
    (or whatever) from it, to give offsets in the range -256..+65535-256.
    So an immediate value of 0 equals an offset of -256.


    Yeah, my thinking was that by the time one has 16 bits for Load/Store displacements, they could almost just go +/- 32K and call it done.

    But, much smaller than this, there is an advantage to scaling the displacements.




    In other news, got around to getting the RISC-V code to build in PIE
    mode for Doom (by using "riscv64-unknown-linux-gnu-*").

    Can note that RV64 code density takes a hit in this case:
    RV64: 299K (.text)
    XG2 : 284K (.text)

    Is this indicative that your ISA and RISC-V are within spitting distance
    of each other in terms of the number of instructions in .text ?? or not ??

    So, apparently using this version of GCC and using "-fPIE" works in my
    favor regarding code density...


    I guess a question is what FDPIC would do if GCC supported it, since
    this would be the closest direct analog to my own ABI.

    What is FDPIC ?? Federal Deposit Processor Insurance Corporation ??
    Final Dopey Position Independent Code ??

    I guess some people are dragging their feet on FDPIC, as there is some debate as to whether or not NOMMU makes sense for RISC-V, along with its associated performance impact if used.

    In my case, if I wanted to go over to simple base-relocatable images,
    this would technically eliminate the need for GBR reloading.

    Checks:
    Simple base-relocatable case actually currently generates bigger
    binaries, I suspect because in this case it is less space-efficient to
    use PC-rel vs GBR-rel.

    Went and added a "pbostatic" option, which sidesteps saving and
    restoring GBR (making the simplifying assumption that functions will
    never be called from outside the current binary).

    This saves roughly 4K (Doom's ".text" shrinks to 280K).

    Would you be willing to compile DOOM with Brian's LLVM compiler and
    show the results ??

    ....
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sat Apr 27 18:44:08 2024
    From Newsgroup: comp.arch

    On 4/27/2024 3:37 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/26/2024 1:59 PM, EricP wrote:
    MitchAlsup1 wrote:
    BGB wrote:

    If one had 16-bit displacements, then unscaled displacements would
    make sense; otherwise scaled displacements seem like a win
    (misaligned displacements being much less common than aligned
    displacements).

    What we need is ~16-bit displacements where 82½%-91¼% are positive.

    How does one use a frame pointer without negative displacements ??

    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values

    A sign extended 16-bit offsets would cover almost all such access needs
    so I really don't see the need for funny business.

    But if you really want a skewed range offset it could use something like >>> excess-256 encoding which zero extends the immediate then subtract 256
    (or whatever) from it, to give offsets in the range -256..+65535-256.
    So an immediate value of 0 equals an offset of -256.


    Yeah, my thinking was that by the time one has 16 bits for Load/Store
    displacements, they could almost just go +/- 32K and call it done.

    But, much smaller than this, there is an advantage to scaling the
    displacements.




    In other news, got around to getting the RISC-V code to build in PIE
    mode for Doom (by using "riscv64-unknown-linux-gnu-*").

    Can note that RV64 code density takes a hit in this case:
       RV64: 299K (.text)
       XG2 : 284K (.text)

    Is this indicative that your ISA and RISC-V are within spitting distance
    of each other in terms of the number of instructions in .text ?? or not ??


    It would appear that, with my current compiler output, both BJX2-XG2 and RISC-V RV64G are within a few percent of each other...

    If adjusting for Jumbo prefixes (with the version that omits GBR reloads):
    XG2: 270K (-10K of Jumbo Prefixes)

    Implying RISC-V now has around 11% more instructions in this scenario.


    It also has an additional 20K of ".rodata" that is likely constants,
    which likely overlap significantly with the jumbo prefixes.


    So, apparently using this version of GCC and using "-fPIE" works in my
    favor regarding code density...


    I guess a question is what FDPIC would do if GCC supported it, since
    this would be the closest direct analog to my own ABI.

    What is FDPIC ?? Federal Deposit Processor Insurance   Corporation ??
                    Final   Dopey   Position  Independent Code ??


    Required a little digging: "Function Descriptor Position Independent Code".

    But, I think the main difference is that, normal PIC does calls like like:
    LD Rt, [GOT+Disp]
    BSR Rt

    Wheres, FDPIC was typically more like (pseudo ASM):
    MOV SavedGOT, GOT
    LEA Rt, [GOT+Disp]
    MOV GOT, [Rt+8]
    MOV Rt, [Rt+0]
    BSR Rt
    MOV GOT, SavedGOT


    But, in my case, noting that function calls tend to be more common than
    the functions themselves, and functions will know whether or not they
    need to access global variables or call other functions, ... it made
    more sense to move this logic into the callee.


    No official RISC-V FDPIC ABI that I am aware of, though some proposals
    did seem vaguely similar in some areas to what I was doing with PBO.

    Where, they were accessing globals like:
    LUI Xt, DispHi
    ADD Xt, Xt, DispLo
    ADD Xt, Xt, GP
    LD Xd, Xt, 0

    Granted, this is less efficient than, say:
    MOV.Q (GBR, Disp33s), Rd

    Though, people didn't really detail the call sequence or prolog/epilog sequences, so less sure how this would work.


    Likely guess, something like:
    MV Xs, GP
    LUI Xt, DispHi
    ADD Xt, Xt, DispLo
    ADD Xt, Xt, GP
    LD GP, Xt, 8
    LD Xt, Xt, 0
    JALR LR, Xt, 0
    MV GP, Xs

    Well, unless they have a better way to pull this off...

    But, yeah, as far as I saw it, my "better solution" was to put this part
    into the callee.


    Main tradeoff with my design is:
    From any GBR, one needs to be able to get to every other GBR;
    We need to have a way to know which table entry to reload (not
    statically known at compile time).

    In my PBO ABI, this was accomplished by using base relocs (but, this is
    N/A for ELF, where PE/COFF style base relocs are not a thing).


    One other option might be to use a PC-relative load to load the index.
    Say:
    AUIPC Xs, DispHi //"__global_pbo_offset$" ?
    LD Xs, DispLo
    LD Xt, GP, 0 //get table of offsets
    ADD Xt, Xt, Xs
    LD GP, Xt, 0

    In this case, "__global_pbo_offset$" would be a magic constant variable
    that gets fixed up by the ELF loader.



    I guess some people are dragging their feet on FDPIC, as there is some
    debate as to whether or not NOMMU makes sense for RISC-V, along with
    its associated performance impact if used.

    In my case, if I wanted to go over to simple base-relocatable images,
    this would technically eliminate the need for GBR reloading.

    Checks:
    Simple base-relocatable case actually currently generates bigger
    binaries, I suspect because in this case it is less space-efficient to
    use PC-rel vs GBR-rel.

    Went and added a "pbostatic" option, which sidesteps saving and
    restoring GBR (making the simplifying assumption that functions will
    never be called from outside the current binary).

    This saves roughly 4K (Doom's ".text" shrinks to 280K).

    Would you be willing to compile DOOM with Brian's LLVM compiler and
    show the results ??


    Will need to download and build this compiler...

    Might need to look into this.


    But, yeah, current standing for this is:
    XG2 : 280K (static linked, Modified PDPCLIB + TestKern)
    RV64G : 299K (static linked, Modified PDPCLIB + TestKern)
    X86-64: 288K ("gcc -O3", dynamically linked GLIBC)
    X64 : 1083K (VS2022, static linked MSVCRT)

    But, MSVC is an outlier here for just how bad it is on this front.

    To get more reference points, would need to install more compilers.

    Could have provided an ARM reference point, except that the compiler
    isn't compiling stuff at the moment (would need to beat on stuff a bit
    more to try to get it to build; appears to be trying to build with static-linked Newlib but is missing symbols, ...).




    But, yeah, for good comparison, one needs to have everything build with
    the same C library, etc.


    I am thinking it may be possible to save a little more space by folding
    some of the stuff for "va_start()" into an ASM blob (currently, a lot of
    stuff is folded off into the function prolog, but probably doesn't need
    to be done inline for every varargs function).

    Mostly this would be the logic for spilling all of the argument
    registers to a location on the stack and similar.



    ....

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 28 01:45:59 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/27/2024 3:37 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/26/2024 1:59 PM, EricP wrote:
    MitchAlsup1 wrote:
    BGB wrote:

    If one had 16-bit displacements, then unscaled displacements would >>>>>> make sense; otherwise scaled displacements seem like a win
    (misaligned displacements being much less common than aligned
    displacements).

    What we need is ~16-bit displacements where 82½%-91¼% are positive. >>>>>
    How does one use a frame pointer without negative displacements ??

    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values

    A sign extended 16-bit offsets would cover almost all such access needs >>>> so I really don't see the need for funny business.

    But if you really want a skewed range offset it could use something like >>>> excess-256 encoding which zero extends the immediate then subtract 256 >>>> (or whatever) from it, to give offsets in the range -256..+65535-256.
    So an immediate value of 0 equals an offset of -256.


    Yeah, my thinking was that by the time one has 16 bits for Load/Store
    displacements, they could almost just go +/- 32K and call it done.

    But, much smaller than this, there is an advantage to scaling the
    displacements.




    In other news, got around to getting the RISC-V code to build in PIE
    mode for Doom (by using "riscv64-unknown-linux-gnu-*").

    Can note that RV64 code density takes a hit in this case:
       RV64: 299K (.text)
       XG2 : 284K (.text)

    Is this indicative that your ISA and RISC-V are within spitting distance
    of each other in terms of the number of instructions in .text ?? or not ?? >>

    It would appear that, with my current compiler output, both BJX2-XG2 and RISC-V RV64G are within a few percent of each other...

    If adjusting for Jumbo prefixes (with the version that omits GBR reloads):
    XG2: 270K (-10K of Jumbo Prefixes)

    Implying RISC-V now has around 11% more instructions in this scenario.

    Based on Brian's LLVM compiler; RISC-V has about 40% more instructions
    than My 66000, or My 66000 has 70% the number of instructions that
    RISC-V has (same compilation flags, same source code).

    It also has an additional 20K of ".rodata" that is likely constants,
    which likely overlap significantly with the jumbo prefixes.

    My 66000 has vastly smaller .rodata because constants are part of .text

    So, apparently using this version of GCC and using "-fPIE" works in my
    favor regarding code density...


    I guess a question is what FDPIC would do if GCC supported it, since
    this would be the closest direct analog to my own ABI.

    What is FDPIC ?? Federal Deposit Processor Insurance   Corporation ??
                    Final   Dopey   Position  Independent Code ??


    Required a little digging: "Function Descriptor Position Independent Code".

    But, I think the main difference is that, normal PIC does calls like like:
    LD Rt, [GOT+Disp]
    BSR Rt

    CALX [IP,,#GOT+#disp-.]

    It is unlikely that %GOT can be represented with 16-bit offset from IP
    so the 32-bit displacement form (,,) is used.

    Wheres, FDPIC was typically more like (pseudo ASM):
    MOV SavedGOT, GOT
    LEA Rt, [GOT+Disp]
    MOV GOT, [Rt+8]
    MOV Rt, [Rt+0]
    BSR Rt
    MOV GOT, SavedGOT

    Since GOT is not in a register but is an address constant this is also::

    CALX [IP,,#GOT+#disp-.]

    But, in my case, noting that function calls tend to be more common than
    the functions themselves, and functions will know whether or not they
    need to access global variables or call other functions, ... it made
    more sense to move this logic into the callee.


    No official RISC-V FDPIC ABI that I am aware of, though some proposals
    did seem vaguely similar in some areas to what I was doing with PBO.

    Where, they were accessing globals like:
    LUI Xt, DispHi
    ADD Xt, Xt, DispLo
    ADD Xt, Xt, GP
    LD Xd, Xt, 0

    Granted, this is less efficient than, say:
    MOV.Q (GBR, Disp33s), Rd

    LDD Rd,[IP,,#GOT+#disp-.]

    Though, people didn't really detail the call sequence or prolog/epilog sequences, so less sure how this would work.


    Likely guess, something like:
    MV Xs, GP
    LUI Xt, DispHi
    ADD Xt, Xt, DispLo
    ADD Xt, Xt, GP
    LD GP, Xt, 8
    LD Xt, Xt, 0
    JALR LR, Xt, 0
    MV GP, Xs

    Well, unless they have a better way to pull this off...

    CALX [IP,,#GOT+#disp-.]

    But, yeah, as far as I saw it, my "better solution" was to put this part into the callee.


    Main tradeoff with my design is:
    From any GBR, one needs to be able to get to every other GBR;
    We need to have a way to know which table entry to reload (not
    statically known at compile time).

    Resolved by linker or accessed through GOT in mine. Each dynamic
    module gets its own GOT.

    In my PBO ABI, this was accomplished by using base relocs (but, this is
    N/A for ELF, where PE/COFF style base relocs are not a thing).


    One other option might be to use a PC-relative load to load the index.
    Say:
    AUIPC Xs, DispHi //"__global_pbo_offset$" ?
    LD Xs, DispLo
    LD Xt, GP, 0 //get table of offsets
    ADD Xt, Xt, Xs
    LD GP, Xt, 0

    In this case, "__global_pbo_offset$" would be a magic constant variable
    that gets fixed up by the ELF loader.

    LDD Rd,[IP,,#GOT+#disp-.]

    I guess some people are dragging their feet on FDPIC, as there is some
    debate as to whether or not NOMMU makes sense for RISC-V, along with
    its associated performance impact if used.

    In my case, if I wanted to go over to simple base-relocatable images,
    this would technically eliminate the need for GBR reloading.

    Checks:
    Simple base-relocatable case actually currently generates bigger
    binaries, I suspect because in this case it is less space-efficient to
    use PC-rel vs GBR-rel.

    Went and added a "pbostatic" option, which sidesteps saving and
    restoring GBR (making the simplifying assumption that functions will
    never be called from outside the current binary).

    This saves roughly 4K (Doom's ".text" shrinks to 280K).

    Would you be willing to compile DOOM with Brian's LLVM compiler and
    show the results ??


    Will need to download and build this compiler...

    Might need to look into this.

    Please do.

    But, yeah, current standing for this is:
    XG2 : 280K (static linked, Modified PDPCLIB + TestKern)
    RV64G : 299K (static linked, Modified PDPCLIB + TestKern)
    X86-64: 288K ("gcc -O3", dynamically linked GLIBC)
    X64 : 1083K (VS2022, static linked MSVCRT)

    But, MSVC is an outlier here for just how bad it is on this front.

    To get more reference points, would need to install more compilers.

    Could have provided an ARM reference point, except that the compiler
    isn't compiling stuff at the moment (would need to beat on stuff a bit
    more to try to get it to build; appears to be trying to build with static-linked Newlib but is missing symbols, ...).




    But, yeah, for good comparison, one needs to have everything build with
    the same C library, etc.


    I am thinking it may be possible to save a little more space by folding
    some of the stuff for "va_start()" into an ASM blob (currently, a lot of stuff is folded off into the function prolog, but probably doesn't need
    to be done inline for every varargs function).

    Mostly this would be the logic for spilling all of the argument
    registers to a location on the stack and similar.

    Part of ENTER already does this: A typical subroutine will use::

    ENTER R27,R0,#local_stack_size

    Where the varargs subroutine will use::

    ENTER R27,R8,#local_stack_size
    ADD Rva_ptr,SP,#local_stack_size+64

    notice all we had to do was to specify 8 more registers to be stored;
    and exit with::

    EXIT R27,R0,#local_stack_size+64

    Here we skip over the 8 register variable arguments without reloading
    them.

    ....
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 28 00:56:01 2024
    From Newsgroup: comp.arch

    On 4/27/2024 8:45 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/27/2024 3:37 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/26/2024 1:59 PM, EricP wrote:
    MitchAlsup1 wrote:
    BGB wrote:

    If one had 16-bit displacements, then unscaled displacements
    would make sense; otherwise scaled displacements seem like a win >>>>>>> (misaligned displacements being much less common than aligned
    displacements).

    What we need is ~16-bit displacements where 82½%-91¼% are positive. >>>>>>
    How does one use a frame pointer without negative displacements ?? >>>>>>
    [FP+disp] accesses callee save registers
    [FP-disp] accesses local stack variables and descriptors

    [SP+disp] accesses argument and result values

    A sign extended 16-bit offsets would cover almost all such access
    needs
    so I really don't see the need for funny business.

    But if you really want a skewed range offset it could use something >>>>> like
    excess-256 encoding which zero extends the immediate then subtract 256 >>>>> (or whatever) from it, to give offsets in the range -256..+65535-256. >>>>> So an immediate value of 0 equals an offset of -256.


    Yeah, my thinking was that by the time one has 16 bits for
    Load/Store displacements, they could almost just go +/- 32K and call
    it done.

    But, much smaller than this, there is an advantage to scaling the
    displacements.




    In other news, got around to getting the RISC-V code to build in PIE
    mode for Doom (by using "riscv64-unknown-linux-gnu-*").

    Can note that RV64 code density takes a hit in this case:
       RV64: 299K (.text)
       XG2 : 284K (.text)

    Is this indicative that your ISA and RISC-V are within spitting
    distance of each other in terms of the number of instructions in
    .text ?? or not ??


    It would appear that, with my current compiler output, both BJX2-XG2
    and RISC-V RV64G are within a few percent of each other...

    If adjusting for Jumbo prefixes (with the version that omits GBR
    reloads):
       XG2: 270K (-10K of Jumbo Prefixes)

    Implying RISC-V now has around 11% more instructions in this scenario.

    Based on Brian's LLVM compiler; RISC-V has about 40% more instructions
    than My 66000, or My 66000 has 70% the number of instructions that
    RISC-V has (same compilation flags, same source code).


    I have made some progress here recently, but it is still a case of (in
    my case):
    Stronger ISA, but with a compiler with a weak optimizer;
    Vs:
    Weaker ISA, but vs a compiler with a stronger optimizer.


    GCC is very clever at figuring out what to optimize...

    Meanwhile, BGBCC may fail to optimize away constant sub-expressions if operator precedence doesn't fall in a preferable direction.

    Say:
    y=x*3*4;
    Doing two multiply instructions in a row, because:
    y=x*12;
    Didn't happen to map to the AST as it was written (because parsing was left-associative in this case).

    Yeah, actually ran into this recently, only solution at present is to
    put parenthesis around the constant parts.


    But, yeah, seemingly GCC isn't fooled by things like precedence order. Seemingly, it may even chase constants across basic blocks or across
    memory loads and stores, causing chunks of code to disappear, etc...

    But, still not enough to make up for RV64G's weaknesses it seems.
    Well, and Doom isn't full of a lot of cases for it to leverage its
    seeming aggressive constant-folding might...



    It also has an additional 20K of ".rodata" that is likely constants,
    which likely overlap significantly with the jumbo prefixes.

    My 66000 has vastly smaller .rodata because constants are part of .text


    Similar, though in my case they exist as Jumbo prefixes.


    Except well, if values are declared as "const double x0=...;", where
    BGBCC ends up treating it like a normal variable that does not allow assignment (so will generate different code than had one used #define or similar).

    Also noted cases of this recently when diffing through my compiler output.

    Does seem to be context-dependent to some extent though...


    So, apparently using this version of GCC and using "-fPIE" works in
    my favor regarding code density...


    I guess a question is what FDPIC would do if GCC supported it, since
    this would be the closest direct analog to my own ABI.

    What is FDPIC ?? Federal Deposit Processor Insurance   Corporation ??
                     Final   Dopey   Position  Independent Code ??


    Required a little digging: "Function Descriptor Position Independent
    Code".

    But, I think the main difference is that, normal PIC does calls like
    like:
       LD Rt, [GOT+Disp]
       BSR Rt

        CALX   [IP,,#GOT+#disp-.]

    It is unlikely that %GOT can be represented with 16-bit offset from IP
    so the 32-bit displacement form (,,) is used.

    Wheres, FDPIC was typically more like (pseudo ASM):
       MOV SavedGOT, GOT
       LEA Rt, [GOT+Disp]
       MOV GOT, [Rt+8]
       MOV Rt, [Rt+0]
       BSR Rt
       MOV GOT, SavedGOT

    Since GOT is not in a register but is an address constant this is also::

        CALX   [IP,,#GOT+#disp-.]


    So... Would this also cause GOT to point to a new address on the callee
    side (that is dependent on the GOT on the caller side, and *not* on the
    PC address at the destination) ?...

    In effect, the context dependent GOT daisy-chaining is a fundamental
    aspect of FDPIC that is different from conventional PIC.


    But, in my case, noting that function calls tend to be more common
    than the functions themselves, and functions will know whether or not
    they need to access global variables or call other functions, ... it
    made more sense to move this logic into the callee.


    No official RISC-V FDPIC ABI that I am aware of, though some proposals
    did seem vaguely similar in some areas to what I was doing with PBO.

    Where, they were accessing globals like:
       LUI Xt, DispHi
       ADD Xt, Xt, DispLo
       ADD Xt, Xt, GP
       LD  Xd, Xt, 0

    Granted, this is less efficient than, say:
       MOV.Q (GBR, Disp33s), Rd

        LDD   Rd,[IP,,#GOT+#disp-.]


    As noted, BJX2 can handle this in a single 64-bit instruction, vs 4 instructions.


    Though, people didn't really detail the call sequence or prolog/epilog
    sequences, so less sure how this would work.


    Likely guess, something like:
       MV    Xs, GP
       LUI   Xt, DispHi
       ADD   Xt, Xt, DispLo
       ADD   Xt, Xt, GP
       LD    GP, Xt, 8
       LD    Xt, Xt, 0
       JALR  LR, Xt, 0
       MV    GP, Xs

    Well, unless they have a better way to pull this off...

        CALX   [IP,,#GOT+#disp-.]


    Well, can you explain the semantics of this one...


    But, yeah, as far as I saw it, my "better solution" was to put this
    part into the callee.


    Main tradeoff with my design is:
       From any GBR, one needs to be able to get to every other GBR;
       We need to have a way to know which table entry to reload (not
    statically known at compile time).

    Resolved by linker or accessed through GOT in mine. Each dynamic
    module gets its own GOT.


    The important thing is not associating a GOT with an ELF module, but
    with an instance of said module.

    So, say, one copy of an ELF image, can have N separate GOTs and data
    sections (each associated with a program instance).

    In my PBO ABI, this was accomplished by using base relocs (but, this
    is N/A for ELF, where PE/COFF style base relocs are not a thing).


    One other option might be to use a PC-relative load to load the index.
    Say:
       AUIPC Xs, DispHi  //"__global_pbo_offset$" ?
       LD Xs, DispLo
       LD Xt, GP, 0   //get table of offsets
       ADD Xt, Xt, Xs
       LD  GP, Xt, 0

    In this case, "__global_pbo_offset$" would be a magic constant
    variable that gets fixed up by the ELF loader.

        LDD   Rd,[IP,,#GOT+#disp-.]


    Still going to need to explain the semantics here...

    Based on previous examples, the above would presumably be a normal
    variable load.

    This was not the purpose of the "__global_pbo_offset$" trick, but more
    how to perform the GP reload in RV64 in a way that does not require base relocs (and was compatible with the ELF way of doing things).


    I guess some people are dragging their feet on FDPIC, as there is
    some debate as to whether or not NOMMU makes sense for RISC-V, along
    with its associated performance impact if used.

    In my case, if I wanted to go over to simple base-relocatable
    images, this would technically eliminate the need for GBR reloading.

    Checks:
    Simple base-relocatable case actually currently generates bigger
    binaries, I suspect because in this case it is less space-efficient
    to use PC-rel vs GBR-rel.

    Went and added a "pbostatic" option, which sidesteps saving and
    restoring GBR (making the simplifying assumption that functions will
    never be called from outside the current binary).

    This saves roughly 4K (Doom's ".text" shrinks to 280K).

    Would you be willing to compile DOOM with Brian's LLVM compiler and
    show the results ??


    Will need to download and build this compiler...

    Might need to look into this.

    Please do.


    Extracting the ZIP file and "git clone llvm-project" etc, have thus far
    taken hours...

    Well, and then the commands to CMake were not working, tried invoking
    cmake more minimally, and it gives a message complaining about the
    version being too old, ...

    Seems I have to build it with a different / newer WSL instance (well, I
    guess it was either this or try to rebuild CMake from source).


    Checks, download for compiler (+ git cloned LLVM) is a little over 6GB.


    Well, OK, now LLVM is building... I guess, will see if it compiles and
    doesn't explode in the process. Probably going to be a while it seems.



    But, yeah, current standing for this is:
       XG2   :  280K (static linked, Modified PDPCLIB + TestKern)
       RV64G :  299K (static linked, Modified PDPCLIB + TestKern)
       X86-64:  288K ("gcc -O3", dynamically linked GLIBC)
       X64   : 1083K (VS2022, static linked MSVCRT)

    But, MSVC is an outlier here for just how bad it is on this front.

    To get more reference points, would need to install more compilers.

    Could have provided an ARM reference point, except that the compiler
    isn't compiling stuff at the moment (would need to beat on stuff a bit
    more to try to get it to build; appears to be trying to build with
    static-linked Newlib but is missing symbols, ...).




    But, yeah, for good comparison, one needs to have everything build
    with the same C library, etc.


    I am thinking it may be possible to save a little more space by
    folding some of the stuff for "va_start()" into an ASM blob
    (currently, a lot of stuff is folded off into the function prolog, but
    probably doesn't need to be done inline for every varargs function).

    Mostly this would be the logic for spilling all of the argument
    registers to a location on the stack and similar.

    Part of ENTER already does this: A typical subroutine will use::

        ENTER    R27,R0,#local_stack_size

    Where the varargs subroutine will use::

        ENTER    R27,R8,#local_stack_size
        ADD      Rva_ptr,SP,#local_stack_size+64

    notice all we had to do was to specify 8 more registers to be stored;
    and exit with::

        EXIT     R27,R0,#local_stack_size+64

    Here we skip over the 8 register variable arguments without reloading
    them.


    It is mostly a chunk of code for storing the argument registers to
    memory, either 8 or 16 depending on the ABI variant. Need to save them
    off to memory mostly so "va_arg()" can see them.

    Previously, this part has been done inline, but is a fairly repetitive
    code sequence...


    Though, folding it off doesn't really seem to have saved all that much...


    ....

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 28 02:11:45 2024
    From Newsgroup: comp.arch

    On 4/28/2024 12:56 AM, BGB wrote:
    On 4/27/2024 8:45 PM, MitchAlsup1 wrote:
    BGB wrote:


    ...


    I guess some people are dragging their feet on FDPIC, as there is
    some debate as to whether or not NOMMU makes sense for RISC-V,
    along with its associated performance impact if used.

    In my case, if I wanted to go over to simple base-relocatable
    images, this would technically eliminate the need for GBR reloading.

    Checks:
    Simple base-relocatable case actually currently generates bigger
    binaries, I suspect because in this case it is less space-efficient >>>>> to use PC-rel vs GBR-rel.

    Went and added a "pbostatic" option, which sidesteps saving and
    restoring GBR (making the simplifying assumption that functions
    will never be called from outside the current binary).

    This saves roughly 4K (Doom's ".text" shrinks to 280K).

    Would you be willing to compile DOOM with Brian's LLVM compiler and
    show the results ??


    Will need to download and build this compiler...

    Might need to look into this.

    Please do.


    Extracting the ZIP file and "git clone llvm-project" etc, have thus far taken hours...

    Well, and then the commands to CMake were not working, tried invoking
    cmake more minimally, and it gives a message complaining about the
    version being too old, ...

    Seems I have to build it with a different / newer WSL instance (well, I guess it was either this or try to rebuild CMake from source).


    Checks, download for compiler (+ git cloned LLVM) is a little over 6GB.


    Well, OK, now LLVM is building... I guess, will see if it compiles and doesn't explode in the process. Probably going to be a while it seems.




    A little over an hour later and it still hasn't broken 50% yet...

    I think LLVM rebuilds may have actually gotten slower than in the past...


    Well, at least my 112GB of RAM means it isn't swapping too much...

    Computer is a little sluggish and the "System" process seems kinda
    pegged out though...



    I guess I will know sometime later whether or not all of this builds...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 28 03:59:28 2024
    From Newsgroup: comp.arch

    On 4/28/2024 2:11 AM, BGB wrote:
    On 4/28/2024 12:56 AM, BGB wrote:
    On 4/27/2024 8:45 PM, MitchAlsup1 wrote:
    BGB wrote:


    ...


    I guess some people are dragging their feet on FDPIC, as there is >>>>>> some debate as to whether or not NOMMU makes sense for RISC-V,
    along with its associated performance impact if used.

    In my case, if I wanted to go over to simple base-relocatable
    images, this would technically eliminate the need for GBR reloading. >>>>>
    Checks:
    Simple base-relocatable case actually currently generates bigger
    binaries, I suspect because in this case it is less
    space-efficient to use PC-rel vs GBR-rel.

    Went and added a "pbostatic" option, which sidesteps saving and
    restoring GBR (making the simplifying assumption that functions
    will never be called from outside the current binary).

    This saves roughly 4K (Doom's ".text" shrinks to 280K).

    Would you be willing to compile DOOM with Brian's LLVM compiler and
    show the results ??


    Will need to download and build this compiler...

    Might need to look into this.

    Please do.


    Extracting the ZIP file and "git clone llvm-project" etc, have thus
    far taken hours...

    Well, and then the commands to CMake were not working, tried invoking
    cmake more minimally, and it gives a message complaining about the
    version being too old, ...

    Seems I have to build it with a different / newer WSL instance (well,
    I guess it was either this or try to rebuild CMake from source).


    Checks, download for compiler (+ git cloned LLVM) is a little over 6GB.


    Well, OK, now LLVM is building... I guess, will see if it compiles and
    doesn't explode in the process. Probably going to be a while it seems.




    A little over an hour later and it still hasn't broken 50% yet...

    I think LLVM rebuilds may have actually gotten slower than in the past...


    Well, at least my 112GB of RAM means it isn't swapping too much...

    Computer is a little sluggish and the "System" process seems kinda
    pegged out though...



    I guess I will know sometime later whether or not all of this builds...



    Still watching LLVM build (several hours later), kinda of an interesting
    meta aspect in its behaviors.

    Sub-stage 1:
    cc1plus processes, they start out mostly idle, sit around idle for a few seconds, CPU activity spikes and then they terminate (and a new one
    spawns to take its place).
    During this sub-process, PC is generally sluggish.

    Sub-stage 2:
    "llvm-tblgen" runs; these processes are short lived and run at high CPU
    load; but PC stops being sluggish for the brief moments it is running
    these (but, soon enough, it is back to the former stage).


    Overall CPU load is fairly modest, and HDD activity is also fairly
    reasonable. Seems like there is a bottleneck that "cc1plus" steps on.

    The "System" process runs at somewhat higher than usual CPU load (~ 8%), process description "NT Kernel & System", where spikes in this process
    seem correlated with the "general sluggishness".


    Seems likely related to the "teh crapton" of files it contained...
    Also does not escape my notice that on the build drive in question, it
    has seemingly eaten over 100GB of disk space during the build project...


    Like, seemingly, LLVM has managed to somehow become more absurd than it
    was in the past...

    Dunno, maybe all this seems like pretty reasonable project design to
    some people, but at least to me, it all seems a little absurd.
    Well, unless some people experience it building quickly and not eating
    huge amounts of HDD space.

    Dunno, maybe related to me running the build on a drive with "Folder Compression" enabled?... (Often times, folder compression makes things
    faster, but this might be one of the times where it doesn't).


    Some of this is kind of a disincentive to trying to build a compiler
    based on LLVM though. Trying to have this as a normal part of the build process seems implausible.

    Like, if it makes Vivado synthesis and implementation seem fast and lightweight in comparison, this is not an ideal situation...

    GCC is an ugly mess, but at least it builds moderately faster.

    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 28 09:05:26 2024
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    Still watching LLVM build (several hours later), kinda of an interesting meta aspect in its behaviors.

    Don't build it in debug mode.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 28 04:43:56 2024
    From Newsgroup: comp.arch

    On 4/28/2024 4:05 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    Still watching LLVM build (several hours later), kinda of an interesting
    meta aspect in its behaviors.

    Don't build it in debug mode.

    I was building it in MinSizeRel mode...


    But, yeah, need to go to sleep... May poke with it tomorrow if all goes well...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 28 14:00:28 2024
    From Newsgroup: comp.arch

    On 4/28/2024 4:43 AM, BGB wrote:
    On 4/28/2024 4:05 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    Still watching LLVM build (several hours later), kinda of an interesting >>> meta aspect in its behaviors.

    Don't build it in debug mode.

    I was building it in MinSizeRel mode...


    Also "-j 4".

    Didn't want to go too much higher as this would likely bog down PC harder.

    Some stuff say builds should not take quite this long, but this is what
    I am seeing...



    But, yeah, need to go to sleep... May poke with it tomorrow if all goes well...


    Seems the options I had used did not enable "clang".

    Had to enable clang and run again, though as LLVM itself was already
    built, it seemed to not need to do much with the parts already built.

    So, rerun with "clang" enabled took ~ 1hr.



    So, errm... Once built, how does one get it to actually get it to target
    the ISA.

    "--target my66000-none-elf" or similar just gets it to complain about an unknown triple, not sure how to query for known targets/triples with clang.


    The built "llc --version" makes no mention of MY66000...
    Seems to self-identify as version 16.0.0git, host CPU znver1.


    Had built the version found here:
    https://github.com/bagel99/llvm-my66000


    Ironically, it does seem to know about ARM64, though trying to build
    anything with it results in it complaining about missing headers ("bits/libc-header-start.h").


    No obvious documentation besides that from LLVM itself, and not super
    easy to figure out otherwise. Was there some option I was supposed to
    give to CMake to enable the target?...


    Granted, there is always a possibility that I screwed something up here
    (or possible interference from another LLVM version?...).

    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 28 19:24:51 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/27/2024 8:45 PM, MitchAlsup1 wrote:


    But, I think the main difference is that, normal PIC does calls like
    like:
       LD Rt, [GOT+Disp]
       BSR Rt

        CALX   [IP,,#GOT+#disp-.]

    It is unlikely that %GOT can be represented with 16-bit offset from IP
    so the 32-bit displacement form (,,) is used.

    Wheres, FDPIC was typically more like (pseudo ASM):
       MOV SavedGOT, GOT
       LEA Rt, [GOT+Disp]
       MOV GOT, [Rt+8]
       MOV Rt, [Rt+0]
       BSR Rt
       MOV GOT, SavedGOT

    Since GOT is not in a register but is an address constant this is also::

        CALX   [IP,,#GOT+#disp-.]


    So... Would this also cause GOT to point to a new address on the callee
    side (that is dependent on the GOT on the caller side, and *not* on the
    PC address at the destination) ?...

    The module on the calling side has its GOT and the module on the called side has its own GOT where offsets to/in GOT are determined by linker making the module. There may be cases where multiple link edits on a final module have some of the functions in this module accessed via GOT in this module and in these cases one uses

    CALA [IP,,#GOT+#disp-.] // LDD ip changes to LDA ip

    In effect, the context dependent GOT daisy-chaining is a fundamental
    aspect of FDPIC that is different from conventional PIC.

    Yes, understood, and it happens.

    But, in my case, noting that function calls tend to be more common
    than the functions themselves, and functions will know whether or not
    they need to access global variables or call other functions, ... it
    made more sense to move this logic into the callee.


    No official RISC-V FDPIC ABI that I am aware of, though some proposals
    did seem vaguely similar in some areas to what I was doing with PBO.

    Where, they were accessing globals like:
       LUI Xt, DispHi
       ADD Xt, Xt, DispLo
       ADD Xt, Xt, GP
       LD  Xd, Xt, 0

    Granted, this is less efficient than, say:
       MOV.Q (GBR, Disp33s), Rd

        LDD   Rd,[IP,,#GOT+#disp-.]


    As noted, BJX2 can handle this in a single 64-bit instruction, vs 4 instructions.


    Though, people didn't really detail the call sequence or prolog/epilog
    sequences, so less sure how this would work.


    Likely guess, something like:
       MV    Xs, GP
       LUI   Xt, DispHi
       ADD   Xt, Xt, DispLo
       ADD   Xt, Xt, GP
       LD    GP, Xt, 8
       LD    Xt, Xt, 0
       JALR  LR, Xt, 0
       MV    GP, Xs

    Well, unless they have a better way to pull this off...

        CALX   [IP,,#GOT+#disp-.]


    Well, can you explain the semantics of this one...


    But, yeah, as far as I saw it, my "better solution" was to put this
    part into the callee.


    Main tradeoff with my design is:
       From any GBR, one needs to be able to get to every other GBR;
       We need to have a way to know which table entry to reload (not
    statically known at compile time).

    Resolved by linker or accessed through GOT in mine. Each dynamic
    module gets its own GOT.


    The important thing is not associating a GOT with an ELF module, but
    with an instance of said module.

    Yes.

    So, say, one copy of an ELF image, can have N separate GOTs and data sections (each associated with a program instance).

    In my PBO ABI, this was accomplished by using base relocs (but, this
    is N/A for ELF, where PE/COFF style base relocs are not a thing).


    One other option might be to use a PC-relative load to load the index.
    Say:
       AUIPC Xs, DispHi  //"__global_pbo_offset$" ?
       LD Xs, DispLo
       LD Xt, GP, 0   //get table of offsets
       ADD Xt, Xt, Xs
       LD  GP, Xt, 0

    In this case, "__global_pbo_offset$" would be a magic constant
    variable that gets fixed up by the ELF loader.

        LDD   Rd,[IP,,#GOT+#disp-.]


    Still going to need to explain the semantics here...

    IP+&GOT+disp-IP is a 64-bit pointer into GOT where the external linkage
    pointer resides.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 28 20:11:56 2024
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    "--target my66000-none-elf" or similar just gets it to complain about an unknown triple, not sure how to query for known targets/triples with clang.

    Grepping around the CMakeCache.txt file in my build directory, I find

    //Semicolon-separated list of experimental targets to build. LLVM_EXPERIMENTAL_TARGETS_TO_BUILD:STRING=My66000

    This is documented in llvm/lib/Target/My66000/README .
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 28 15:20:00 2024
    From Newsgroup: comp.arch

    On 4/28/2024 3:11 PM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    "--target my66000-none-elf" or similar just gets it to complain about an
    unknown triple, not sure how to query for known targets/triples with clang.

    Grepping around the CMakeCache.txt file in my build directory, I find

    //Semicolon-separated list of experimental targets to build. LLVM_EXPERIMENTAL_TARGETS_TO_BUILD:STRING=My66000

    This is documented in llvm/lib/Target/My66000/README .

    I realized after posting this that I had cloned the wrong branch...

    I had cloned "main", seems the My66000 stuff was in the "my66000"
    branch; somehow didn't realize that there would be multiple branches
    with different stuff in each branch.


    But, now is the process of waiting for this branch to build...

    Probably a few more hours, then I will see what happens.

    Build status currently at ~20%.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 28 16:16:07 2024
    From Newsgroup: comp.arch

    On 4/28/2024 2:24 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/27/2024 8:45 PM, MitchAlsup1 wrote:


    But, I think the main difference is that, normal PIC does calls like
    like:
       LD Rt, [GOT+Disp]
       BSR Rt

         CALX   [IP,,#GOT+#disp-.]

    It is unlikely that %GOT can be represented with 16-bit offset from IP
    so the 32-bit displacement form (,,) is used.

    Wheres, FDPIC was typically more like (pseudo ASM):
       MOV SavedGOT, GOT
       LEA Rt, [GOT+Disp]
       MOV GOT, [Rt+8]
       MOV Rt, [Rt+0]
       BSR Rt
       MOV GOT, SavedGOT

    Since GOT is not in a register but is an address constant this is also:: >>>
         CALX   [IP,,#GOT+#disp-.]


    So... Would this also cause GOT to point to a new address on the
    callee side (that is dependent on the GOT on the caller side, and
    *not* on the PC address at the destination) ?...

    The module on the calling side has its GOT and the module on the called
    side
    has its own GOT where offsets to/in GOT are determined by linker making the module. There may be cases where multiple link edits on a final module have some of the functions in this module accessed via GOT in this module and in these cases one uses
        CALA   [IP,,#GOT+#disp-.]     // LDD ip changes to LDA ip


    OK, but it seems I may be failing to understand something here...


    In effect, the context dependent GOT daisy-chaining is a fundamental
    aspect of FDPIC that is different from conventional PIC.

    Yes, understood, and it happens.

    But, in my case, noting that function calls tend to be more common
    than the functions themselves, and functions will know whether or
    not they need to access global variables or call other functions,
    ... it made more sense to move this logic into the callee.


    No official RISC-V FDPIC ABI that I am aware of, though some
    proposals did seem vaguely similar in some areas to what I was doing
    with PBO.

    Where, they were accessing globals like:
       LUI Xt, DispHi
       ADD Xt, Xt, DispLo
       ADD Xt, Xt, GP
       LD  Xd, Xt, 0

    Granted, this is less efficient than, say:
       MOV.Q (GBR, Disp33s), Rd

         LDD   Rd,[IP,,#GOT+#disp-.]


    As noted, BJX2 can handle this in a single 64-bit instruction, vs 4
    instructions.


    Though, people didn't really detail the call sequence or
    prolog/epilog sequences, so less sure how this would work.


    Likely guess, something like:
       MV    Xs, GP
       LUI   Xt, DispHi
       ADD   Xt, Xt, DispLo
       ADD   Xt, Xt, GP
       LD    GP, Xt, 8
       LD    Xt, Xt, 0
       JALR  LR, Xt, 0
       MV    GP, Xs

    Well, unless they have a better way to pull this off...

         CALX   [IP,,#GOT+#disp-.]


    Well, can you explain the semantics of this one...


    But, yeah, as far as I saw it, my "better solution" was to put this
    part into the callee.


    Main tradeoff with my design is:
       From any GBR, one needs to be able to get to every other GBR;
       We need to have a way to know which table entry to reload (not
    statically known at compile time).

    Resolved by linker or accessed through GOT in mine. Each dynamic
    module gets its own GOT.


    The important thing is not associating a GOT with an ELF module, but
    with an instance of said module.

    Yes.

    So, say, one copy of an ELF image, can have N separate GOTs and data
    sections (each associated with a program instance).

    In my PBO ABI, this was accomplished by using base relocs (but, this
    is N/A for ELF, where PE/COFF style base relocs are not a thing).


    One other option might be to use a PC-relative load to load the index. >>>> Say:
       AUIPC Xs, DispHi  //"__global_pbo_offset$" ?
       LD Xs, DispLo
       LD Xt, GP, 0   //get table of offsets
       ADD Xt, Xt, Xs
       LD  GP, Xt, 0

    In this case, "__global_pbo_offset$" would be a magic constant
    variable that gets fixed up by the ELF loader.

         LDD   Rd,[IP,,#GOT+#disp-.]


    Still going to need to explain the semantics here...

    IP+&GOT+disp-IP is a 64-bit pointer into GOT where the external linkage pointer resides.

    OK.

    Not sure I follow here what exactly is going on...


    As noted, if I did a similar thing to the RISC-V example, but with my
    own ISA (with the MOV.C extension):
    MOV.Q (PC, Disp33), R0
    MOV.Q (GBR, 0), R18
    MOV.C (R18, R0), GBR

    Differing mostly in that it doesn't require base relocs.

    The normal version in my case avoids the extra memory load, but uses a
    base reloc for the table index.

    ...



    Though, the reloc format is at least semi-dense, eg, for a block of relocs:
    { DWORD rvaPage; //address of page (4K)
    DWORD szRelocs; //size of relocs in block
    }
    With each reloc encoded as a 16-bit entry:
    (15:12): Reloc Type
    (11: 0): Address within Page (4K)

    One downside is this format is less efficient for sparse relocs (current situation), where often there are only 1 or 2 relocs per page (typically
    the PBO index fixups and similar).


    One situation could be to have a modified format that partially omits
    the block structuring, say:
    0ddd: Advance current page position by ddd pages (4K);
    0000: Effectively a NOP (as before)
    1ddd..Cddd: Apply the given reloc.
    These represent typical relocs, target dependent.
    HI16, LO16, DIR32, HI32ADJ, ...
    8ddd: Was assigned for PBO fixups;
    Addd: Fixup for a 64-bit address, also semi common.
    Dzzz/Ezzz: Extended Relocs
    These ones are configurable from a larger set of reloc types.
    Fzzz: Command-Escape
    ...

    Where, say, rather than needing 1 block per 4K page, it is 1 block per
    PE section.


    Though, base relocs are a relatively small part of the size of the binary.


    To some extent, the PBO reloc is magic in that it works by
    pattern-matching the instruction that it finds at the given address. So,
    in effect, is only defined for a limited range of instructions.

    Contrast with, say, the 1/2/3/4/A relocs, which expect raw 16/32/64 bit values. Though, a lot of these are not currently used for BJX2 (does not
    use 16-bit addressing nides, ...).

    Here:
    5/6/7/8/9/B/C, ended up used for BJX2 relocs in BJX2 mode.
    For other targets, they would have other meanings.
    D/E/F were reserved as expanded/escape-case relocs, in case I need to
    add more. These would differ partly in that the reloc sub-type would be assigned as a sort of state-machine.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Sun Apr 28 15:37:17 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:

    John Savard wrote:

    On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    BGB wrote:

    Sign-extend signed values, zero-extend unsigned values.

    Another mistake I mad in Mc 88100.

    As that is a mistake the IBM 360 made, I make it too. But I make it
    the way the 360 did: there are no signed and unsigned values, in the
    sense of a Burroughs machine, there are just Load, Load Unsigned - and
    Insert - instructions.

    Index and base register values are assumed to be unsigned.

    I would use the term signless as opposed to unsigned.

    What's the point of using a non-standard term when there is a
    common and firmly established standard term? I don't see how the
    non-standard term conveys anything different. Next thing you
    know someone will want to say "signful" rather than "signed".

    Address arithmetic is ADD only and does not care about signs or
    overflow. There is no concept of a negative base register or a
    negative index register (or, for that matter, a negative displace-
    ment), overflow, underflow, carry, ...

    Some people here have argued that (for some architectures), addresses
    with the high-order bit set should be taken as negative rather than
    positive. Or did you mean your comment to apply only to certain
    architectures (IBM 360, Mc 88100, perhaps others?), and not to
    all architectures?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 28 22:56:41 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/28/2024 2:24 PM, MitchAlsup1 wrote:

    Still going to need to explain the semantics here...

    IP+&GOT+disp-IP is a 64-bit pointer into GOT where the external linkage
    pointer resides.

    OK.

    Not sure I follow here what exactly is going on...

    While I am sure I don't understand what is going on....

    As noted, if I did a similar thing to the RISC-V example, but with my
    own ISA (with the MOV.C extension):
    MOV.Q (PC, Disp33), R0 // What data does this access ?
    MOV.Q (GBR, 0), R18
    MOV.C (R18, R0), GBR

    It appears to me that you are placing an array of GOT pointers at
    the first entry of any particular GOT ?!?

    Whereas My 66000 uses IP relative access to the GOT the linker
    (or LD.so) setup avoiding the indirection.

    Then My 66000 does not have or need a pointer to GOT since it can
    synthesize such a pointer at link time and then just use a IP relative
    plus DISP32 to access said GOT.

    So, say we have some external variables::

    extern uint64_t fred, wilma, barney, betty;

    AND we postulate that the linker found all 4 externs in the same module
    so that it can access them all via 1 pointer. The linker assigns an
    index into GOT and setups a relocation to that memory segment and when
    LD.so runs, it stores a proper pointer in that index of GOT, call this
    index fred_index.

    And we access one of these::

    if( fred at_work )

    The compiler will obtain the pointer to the area fred is positioned via:

    LDD Rfp,[IP,,#GOT+fred_index<<3] // *

    and from here one can access barney, betty and wilma using the pointer
    to fred and standard offsetting.

    LDD Rfred,[Rfp,#0] // fred
    LDD Rbarn,[Rfp,#16] // barney
    LDD Rbett,[Rfp,#24] // betty
    LDD Rwilm,[Rfp,#8] // wilma

    These offsets are known at link time and possibly not at compile time.

    (*) if the LDD through GOT takes a page fault, we have a procedure setup
    so LD.so can run figure out which entry is missing, look up where it is (possibly load and resolve it) and insert the required data into GOT.
    When control returns to LDD, the entry is now present, and we now have
    access to fred, wilma, barney and betty.

    Differing mostly in that it doesn't require base relocs.

    The normal version in my case avoids the extra memory load, but uses a
    base reloc for the table index.

    ....

    {{ // this looks like stuff that should be accessible to LD.so

    Though, the reloc format is at least semi-dense, eg, for a block of relocs:
    { DWORD rvaPage; //address of page (4K)
    DWORD szRelocs; //size of relocs in block
    }
    With each reloc encoded as a 16-bit entry:
    (15:12): Reloc Type
    (11: 0): Address within Page (4K)

    One downside is this format is less efficient for sparse relocs (current situation), where often there are only 1 or 2 relocs per page (typically
    the PBO index fixups and similar).


    One situation could be to have a modified format that partially omits
    the block structuring, say:
    0ddd: Advance current page position by ddd pages (4K);
    0000: Effectively a NOP (as before)
    1ddd..Cddd: Apply the given reloc.
    These represent typical relocs, target dependent.
    HI16, LO16, DIR32, HI32ADJ, ...
    8ddd: Was assigned for PBO fixups;
    Addd: Fixup for a 64-bit address, also semi common.
    Dzzz/Ezzz: Extended Relocs
    These ones are configurable from a larger set of reloc types.
    Fzzz: Command-Escape
    ...

    Where, say, rather than needing 1 block per 4K page, it is 1 block per
    PE section.


    Though, base relocs are a relatively small part of the size of the binary.


    To some extent, the PBO reloc is magic in that it works by
    pattern-matching the instruction that it finds at the given address. So,
    in effect, is only defined for a limited range of instructions.

    Contrast with, say, the 1/2/3/4/A relocs, which expect raw 16/32/64 bit values. Though, a lot of these are not currently used for BJX2 (does not
    use 16-bit addressing nides, ...).

    Here:
    5/6/7/8/9/B/C, ended up used for BJX2 relocs in BJX2 mode.
    For other targets, they would have other meanings.
    D/E/F were reserved as expanded/escape-case relocs, in case I need to
    add more. These would differ partly in that the reloc sub-type would be assigned as a sort of state-machine.


    but not the program itself}}
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 28 23:45:45 2024
    From Newsgroup: comp.arch

    On 4/28/2024 5:56 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/28/2024 2:24 PM, MitchAlsup1 wrote:

    Still going to need to explain the semantics here...

    IP+&GOT+disp-IP is a 64-bit pointer into GOT where the external linkage
    pointer resides.

    OK.

    Not sure I follow here what exactly is going on...

    While I am sure I don't understand what is going on....

    As noted, if I did a similar thing to the RISC-V example, but with my
    own ISA (with the MOV.C extension):
         MOV.Q (PC, Disp33), R0            // What data does this access ?
         MOV.Q (GBR, 0), R18
         MOV.C (R18, R0), GBR

    It appears to me that you are placing an array of GOT pointers at the
    first entry of any particular GOT ?!?



    They are not really GOTs in my PBO ABI, but rather the start of every
    ".data" section.

    In this case, every ".data" section starts with a pointer to an array
    that holds a pointer to every other ".data" section in the process
    image, and every DLL is assigned an index in this array (except the main
    EXE, which always has an index value of 0).

    Every program instance exists relative to this array of ".data" sections.


    So, say, "Process 1" will have one version of this array, "Process 2"
    will have another, etc.

    And, all of the data sections in Process 1 will point to the array for
    Process 1. And, all of the data sections in Process 2 will point to the
    array for Process 2. And so on...


    So, even if all the ".text" sections are shared between "Process 1" and "Process 2" (with both existing within the same address space), because
    the data sections are separate; each has its own set of global
    variables, so the processes effectively don't see the other versions of
    the program running within the shared address space.



    In some sense, FDPIC is vague similar, but does use GOTs, but
    effectively daisy-chains all the GOTs together with all the other GOTs
    (having a GOT pointer for every function pointer in the GOT).

    But, as can be noted, this does add some overhead.

    In my case, I had wanted to do something similar to FDPIC in the sense
    of allowing multiple instances without needing to duplicate the
    read-only sections. But, I also wanted a lower performance overhead.



    Whereas My 66000 uses IP relative access to the GOT the linker (or
    LD.so) setup avoiding the indirection.
    Then My 66000 does not have or need a pointer to GOT since it can
    synthesize such a pointer at link time and then just use a IP relative
    plus DISP32 to access said GOT.



    This approach works so long as one has a one-to-one mapping between
    loaded binaries, and their associated sets of global variables (or, if
    each mapping exists in its own address space).

    Doesn't work so well for a many-to-one mapping within a shared address
    space.


    So, say, if you only have one instance of a binary, getting the GOT or
    data sections relative to PC/IP can work.

    But, with multiple instances, it does not work. The data sections can
    only be relative to the other data sections (or to the process context).



    Like, say, if you wanted to support a multitasking operating system on hardware that doesn't have either virtual memory or segments.

    Or, if one does have virtual memory, but wants to keep it as optional.

    Say, for example, uClinux...



    So, say we have some external variables::

        extern uint64_t fred, wilma, barney, betty;

    AND we postulate that the linker found all 4 externs in the same module
    so that it can access them all via 1 pointer. The linker assigns an
    index into GOT and setups a relocation to that memory segment and when
    LD.so runs, it stores a proper pointer in that index of GOT, call this
    index fred_index.

    And we access one of these::

        if( fred at_work )

    The compiler will obtain the pointer to the area fred is positioned via:

        LDD    Rfp,[IP,,#GOT+fred_index<<3]        // *


    And, the above is where the problem lies...

    Would be valid for ELF PIC or PIE binaries, but is not valid for PBO or
    FDPIC.



    and from here one can access barney, betty and wilma using the pointer
    to fred and standard offsetting.

        LDD    Rfred,[Rfp,#0]     // fred
        LDD    Rbarn,[Rfp,#16]    // barney
        LDD    Rbett,[Rfp,#24]    // betty
        LDD    Rwilm,[Rfp,#8]     // wilma

    These offsets are known at link time and possibly not at compile time.

    (*) if the LDD through GOT takes a page fault, we have a procedure setup
    so LD.so can run figure out which entry is missing, look up where it is (possibly load and resolve it) and insert the required data into GOT.
    When control returns to LDD, the entry is now present, and we now have access to fred, wilma, barney and betty.


    Yeah.


    Differing mostly in that it doesn't require base relocs.

    The normal version in my case avoids the extra memory load, but uses a
    base reloc for the table index.

    ....

    {{ // this looks like stuff that should be accessible to LD.so

    Though, the reloc format is at least semi-dense, eg, for a block of
    relocs:
       { DWORD rvaPage;   //address of page (4K)
         DWORD szRelocs;  //size of relocs in block
       }
    With each reloc encoded as a 16-bit entry:
       (15:12): Reloc Type
       (11: 0): Address within Page (4K)

    One downside is this format is less efficient for sparse relocs
    (current situation), where often there are only 1 or 2 relocs per page
    (typically the PBO index fixups and similar).


    One situation could be to have a modified format that partially omits
    the block structuring, say:
       0ddd: Advance current page position by ddd pages (4K);
         0000: Effectively a NOP (as before)
       1ddd..Cddd: Apply the given reloc.
         These represent typical relocs, target dependent.
         HI16, LO16, DIR32, HI32ADJ, ...
         8ddd: Was assigned for PBO fixups;
         Addd: Fixup for a 64-bit address, also semi common.
       Dzzz/Ezzz: Extended Relocs
         These ones are configurable from a larger set of reloc types.
       Fzzz: Command-Escape
       ...

    Where, say, rather than needing 1 block per 4K page, it is 1 block per
    PE section.


    Tested the above tweak, it can reduce the size of the ".reloc" section
    by around 20%, but would break compatibility with previous versions of
    my PEL loader.



    Though, base relocs are a relatively small part of the size of the
    binary.


    To some extent, the PBO reloc is magic in that it works by
    pattern-matching the instruction that it finds at the given address.
    So, in effect, is only defined for a limited range of instructions.

    Contrast with, say, the 1/2/3/4/A relocs, which expect raw 16/32/64
    bit values. Though, a lot of these are not currently used for BJX2
    (does not use 16-bit addressing nides, ...).

    Here:
    5/6/7/8/9/B/C, ended up used for BJX2 relocs in BJX2 mode.
       For other targets, they would have other meanings.
    D/E/F were reserved as expanded/escape-case relocs, in case I need to
    add more. These would differ partly in that the reloc sub-type would
    be assigned as a sort of state-machine.


    but not the program itself}}

    As noted, the base relocs are applied by the PE / PEL loader.

    But, annoyingly, this would not map over so well to an ELF loader...


    Note that despite PEL keeping the same high level structure as PE/COFF,
    in the case of PBO, effectively the binary is split in half.

    So, the read-only sections (".text" and friends), and the read/write
    sections (".data" and ".bss") are entirely disjoint in memory.

    Likewise, read-only sections may not point to read/write sections, and
    the reloc's are effectively applied in two different stages (for the
    read-only sections when the binary is loaded into memory; and to the read/write sections when a new instance is created).




    Meanwhile, got the My66000 LLVM/Clang compiler built so far as that it
    at least seems to try to build something (and seems to know that the
    target exists).


    But, also tends to die in s storm of error messages, eg:

    /tmp/m_swap-822054.s:6: Error: no such instruction: `bitr r1,r1,<8:48>' /tmp/m_swap-822054.s:14: Error: no such instruction: `srl r2,r1,<0:24>' /tmp/m_swap-822054.s:15: Error: no such instruction: `srl r3,r1,<0:8>' /tmp/m_swap-822054.s:16: Error: too many memory references for `and' /tmp/m_swap-822054.s:17: Error: no such instruction: `sll r4,r1,<0:8>' /tmp/m_swap-822054.s:18: Error: too many memory references for `and' /tmp/m_swap-822054.s:19: Error: no such instruction: `sll r1,r1,<0:24>' /tmp/m_swap-822054.s:20: Error: too many memory references for `or' /tmp/m_swap-822054.s:21: Error: too many memory references for `or' /tmp/m_swap-822054.s:22: Error: too many memory references for `or' /tmp/m_cheat-f6c778.s: Assembler messages:
    /tmp/m_cheat-f6c778.s:6: Error: no such instruction: `ldub
    r3,[ip,firsttime]'
    /tmp/m_cheat-f6c778.s:7: Error: no such instruction: `bb1 0,r3,.LBB0_3' /tmp/m_cheat-f6c778.s:8: Error: no such instruction: `stb ' /tmp/m_cheat-f6c778.s:9: Error: expecting operand after ','; got nothing /tmp/m_cheat-f6c778.s:10: Error: too many memory references for `mov' /tmp/m_cheat-f6c778.s:12: Error: no such instruction: `bitr r5,r4,<1:56>' /tmp/m_cheat-f6c778.s:13: Error: no such instruction: `sll r6,r3,<0:1>' /tmp/m_cheat-f6c778.s:14: Error: too many memory references for `and' /tmp/m_cheat-f6c778.s:15: Error: no such instruction: `srl r7,r3,<0:1>' /tmp/m_cheat-f6c778.s:16: Error: too many memory references for `and' /tmp/m_cheat-f6c778.s:17: Error: no such instruction: `srl r8,r3,<0:5>' /tmp/m_cheat-f6c778.s:18: Error: too many memory references for `and' /tmp/m_cheat-f6c778.s:19: Error: no such instruction: `srl r9,r3,<1:7>' /tmp/m_cheat-f6c778.s:20: Error: too many memory references for `and' /tmp/m_cheat-f6c778.s:21: Error: too many memory references for `or' /tmp/m_cheat-f6c778.s:22: Error: too many memory references for `or' /tmp/m_cheat-f6c778.s:23: Error: too many memory references for `or' /tmp/m_cheat-f6c778.s:24: Error: too many memory references for `or' /tmp/m_cheat-f6c778.s:25: Warning: `r5' is not valid here (expected
    `(%rdi)')
    /tmp/m_cheat-f6c778.s:25: Warning: `r5' is not valid here (expected
    `(%rdi)')
    /tmp/m_cheat-f6c778.s:25: Error: too many memory references for `ins' /tmp/m_cheat-f6c778.s:26: Error: no such instruction: `stb r5,[ip,r3,cheat_xlate_table]'
    /tmp/m_cheat-f6c778.s:27: Error: too many memory references for `add' /tmp/m_cheat-f6c778.s:28: Error: too many memory references for `add' /tmp/m_cheat-f6c778.s:29: Error: too many memory references for `cmp' /tmp/m_cheat-f6c778.s:30: Error: no such instruction: `bne r5,.LBB0_2' /tmp/m_cheat-f6c778.s:32: Error: no such instruction: `ldd r4,[r1]' /tmp/m_cheat-f6c778.s:33: Error: expecting operand after ','; got nothing /tmp/m_cheat-f6c778.s:34: Error: no such instruction: `beq0 r4,.LBB0_17' /tmp/m_cheat-f6c778.s:35: Error: no such instruction: `ldd r5,[r1,8]' /tmp/m_cheat-f6c778.s:36: Error: no such instruction: `beq0 r5,.LBB0_5' /tmp/m_cheat-f6c778.s:37: Error: no such instruction: `ldub r6,[r5]' /tmp/m_cheat-f6c778.s:38: Error: no such instruction: `beq0 r6,.LBB0_7' /tmp/m_cheat-f6c778.s:40: Error: too many memory references for `and' /tmp/m_cheat-f6c778.s:41: Error: no such instruction: `ldub r2,[ip,r2,cheat_xlate_table]'
    /tmp/m_cheat-f6c778.s:42: Error: too many memory references for `cmp' /tmp/m_cheat-f6c778.s:43: Error: no such instruction: `bne r2,.LBB0_10' /tmp/m_cheat-f6c778.s:44: Error: too many memory references for `add' /tmp/m_cheat-f6c778.s:45: Error: too many memory references for `std' /tmp/m_cheat-f6c778.s:46: Error: no such instruction: `ldub r2,[r4]' /tmp/m_cheat-f6c778.s:47: Error: too many memory references for `cmp' /tmp/m_cheat-f6c778.s:48: Error: no such instruction: `bne r5,.LBB0_12' /tmp/m_cheat-f6c778.s:49: Error: no such instruction: `br .LBB0_14' /tmp/m_cheat-f6c778.s:52: Error: too many memory references for `mov' /tmp/m_cheat-f6c778.s:55: Error: too many memory references for `std' /tmp/m_cheat-f6c778.s:56: Error: too many memory references for `mov' /tmp/m_cheat-f6c778.s:57: Error: no such instruction: `ldub r6,[r5]' /tmp/m_cheat-f6c778.s:58: Error: no such instruction: `bne0 r6,.LBB0_8' /tmp/m_cheat-f6c778.s:60: Error: too many memory references for `add' /tmp/m_cheat-f6c778.s:61: Error: too many memory references for `std' /tmp/m_cheat-f6c778.s:62: Error: no such instruction: `stb r2,[r5]' /tmp/m_cheat-f6c778.s:63: Error: no such instruction: `ldd r4,[r1,8]' /tmp/m_cheat-f6c778.s:64: Error: no such instruction: `ldub r2,[r4]' /tmp/m_cheat-f6c778.s:65: Error: too many memory references for `cmp' /tmp/m_cheat-f6c778.s:66: Error: no such instruction: `bne r5,.LBB0_12' /tmp/m_cheat-f6c778.s:68: Error: no such instruction: `ldd r2,[r1]' /tmp/m_cheat-f6c778.s:69: Error: expecting operand after ','; got nothing /tmp/m_cheat-f6c778.s:70: Error: too many memory references for `std' /tmp/m_cheat-f6c778.s:71: Error: too many memory references for `mov' /tmp/m_cheat-f6c778.s:74: Error: too many memory references for `std' /tmp/m_cheat-f6c778.s:75: Error: no such instruction: `ldub r2,[r4]' /tmp/m_cheat-f6c778.s:76: Error: too many memory references for `cmp' /tmp/m_cheat-f6c778.s:77: Error: no such instruction: `beq r5,.LBB0_14' /tmp/m_cheat-f6c778.s:79: Error: too many memory references for `cmp' /tmp/m_cheat-f6c778.s:80: Error: no such instruction: `bne r2,.LBB0_16' /tmp/m_cheat-f6c778.s:81: Error: too many memory references for `add' /tmp/m_cheat-f6c778.s:82: Error: expecting operand after ','; got nothing /tmp/m_cheat-f6c778.s:83: Error: too many memory references for `std' /tmp/m_cheat-f6c778.s:84: Error: too many memory references for `mov' /tmp/m_cheat-f6c778.s:92: Error: no such instruction: `ldd r3,[r1]' /tmp/m_cheat-f6c778.s:94: Error: too many memory references for `add' /tmp/m_cheat-f6c778.s:95: Error: no such instruction: `ldub r3,[r3]' /tmp/m_cheat-f6c778.s:96: Error: too many memory references for `cmp' /tmp/m_cheat-f6c778.s:97: Error: too many memory references for `mov' /tmp/m_cheat-f6c778.s:98: Error: no such instruction: `bne r4,.LBB1_1' /tmp/m_cheat-f6c778.s:99: Error: no such instruction: `ldub r4,[r1]' /tmp/m_cheat-f6c778.s:100: Error: expecting operand after ','; got nothing /tmp/m_cheat-f6c778.s:102: Error: too many memory references for `mov' /tmp/m_cheat-f6c778.s:103: Error: no such instruction: `stb r4,[r2,r3,-1]' /tmp/m_cheat-f6c778.s:104: Error: too many memory references for `and' /tmp/m_cheat-f6c778.s:105: Error: no such instruction: `stb ' /tmp/m_cheat-f6c778.s:106: Error: no such instruction: `ldub r4,[r1,r3,0]' /tmp/m_cheat-f6c778.s:107: Error: too many memory references for `and' /tmp/m_cheat-f6c778.s:108: Error: no such instruction: `beq0 r6,.LBB1_5' /tmp/m_cheat-f6c778.s:109: Error: too many memory references for `add' /tmp/m_cheat-f6c778.s:110: Error: too many memory references for `cmp' /tmp/m_cheat-f6c778.s:111: Error: no such instruction: `bne r5,.LBB1_3' /tmp/m_cheat-f6c778.s:112: Error: no such instruction: `br .LBB1_6' /tmp/m_cheat-f6c778.s:114: Error: too many memory references for `cmp' /tmp/m_cheat-f6c778.s:115: Error: no such instruction: `beq r1,.LBB1_6' /tmp/m_cheat-f6c778.s:118: Error: no such instruction: `stb ' /tmp/m_random-1b60b6.s: Assembler messages:
    /tmp/m_random-1b60b6.s:6: Error: no such instruction: `lduw
    r1,[ip,prndindex]'
    /tmp/m_random-1b60b6.s:7: Error: too many memory references for `add' /tmp/m_random-1b60b6.s:8: Error: too many memory references for `and' /tmp/m_random-1b60b6.s:9: Error: no such instruction: `stw
    r1,[ip,prndindex]'
    /tmp/m_random-1b60b6.s:10: Error: no such instruction: `ldub r1,[ip,r1,rndtable]'
    /tmp/m_random-1b60b6.s:18: Error: no such instruction: `lduw
    r1,[ip,rndindex]'
    /tmp/m_random-1b60b6.s:19: Error: too many memory references for `add' /tmp/m_random-1b60b6.s:20: Error: too many memory references for `and' /tmp/m_random-1b60b6.s:21: Error: no such instruction: `stw
    r1,[ip,rndindex]'
    /tmp/m_random-1b60b6.s:22: Error: no such instruction: `ldub r1,[ip,r1,rndtable]'
    /tmp/m_random-1b60b6.s:30: Error: no such instruction: `stw ' /tmp/m_random-1b60b6.s:31: Error: no such instruction: `stw '

    ...

    /tmp/sounds-95ce64.s: Assembler messages:
    /tmp/sounds-95ce64.s:346: Error: unknown pseudo-op: `.dword' /tmp/sounds-95ce64.s:349: Error: unknown pseudo-op: `.dword' /tmp/sounds-95ce64.s:352: Error: unknown pseudo-op: `.dword' /tmp/sounds-95ce64.s:355: Error: unknown pseudo-op: `.dword' /tmp/sounds-95ce64.s:358: Error: unknown pseudo-op: `.dword' /tmp/sounds-95ce64.s:361: Error: unknown pseudo-op: `.dword' /tmp/sounds-95ce64.s:364: Error: unknown pseudo-op: `.dword'

    ...

    and so on...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 29 16:53:29 2024
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:


    Meanwhile, got the My66000 LLVM/Clang compiler built so far as that it
    at least seems to try to build something (and seems to know that the
    target exists).


    But, also tends to die in s storm of error messages, eg:

    /tmp/m_swap-822054.s:6: Error: no such instruction: `bitr r1,r1,<8:48>'

    You can only generate assembly code, so just use "-S".

    If you want to assemble to object files, you can use my binutils
    branch on github. I have not yet started on the linker (there
    are still quite a few decisions to be made regarding relocations,
    which is a topic that I do not enjoy too much.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Apr 30 15:45:21 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    Lawrence D'Oliveiro wrote:

    On Sat, 20 Apr 2024 18:06:22 -0600, John Savard wrote:

    Since there was only one set of arithmetic instrucions, that meant that
    when you wrote code to operate on unsigned values, you had to remember
    that the normal names of the condition code values were oriented around
    signed arithmetic.

    I thought architectures typically had separate condition codes for “carry”
    versus “overflow”. That way, you didn’t need signed versus unsigned >> versions of add, subtract and compare; it was just a matter of looking at >> the right condition codes on the result.

    Maybe now with 4-or-5-bit condition codes yes,
    But the early machines (360) with 2-bit codes were already constricted.

    The B3500 (contemporaneous with 360) had COMS toggles (2 bits) and OVERFLOW toggle (1 bit).
    --- Synchronet 3.20a-Linux NewsLink 1.114