• Re: Pseudo-Immediates as Part of the Instruction

    From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Aug 24 18:16:12 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Sun, 03 Aug 2025 13:03:21 -0700, Stephen Fuld wrote:

    I suspect that the purpose of Thomas's suggestion wasn't to make the
    design clearer to him, but to force you to discover/think about the
    utility and ease of use of some of the features you propose *in real programs* . If a typical programmer can't figure out how to use some
    CPU feature, it probably won't be used, and thus probably should not be
    in the architecture. The best way to learn about what features are
    useful is to try to use them! and the best way to do that is to write actual code for a real program.

    While I'm not prepared to go to the trouble of creating a fleshed-out example, a very short and trivial example will still indicate what my
    goals are.

    X = Y * 2.78 + Z

    Just playing devil's advocate:: My 66000

    LDD R8,[Y]
    LDD R6,[Z]
    FMAC R7,R8,#2.78D0,R6
    STD R7,[X]

    X, Y, and Z can be anywhere in 64-bit VAS ...
    On the other hand if X, Y, and Z were allocated into registers::

    FMAC Rx,Ry,#2.78D0,Rz

    On a typical RISC architecture, this would involve instructions like this:

    load 18, Y
    load 19, K#0001
    fmul 18, 18, 19
    load 19, Z
    fadd 18, 18, 19
    fsto X

    Six instructions, each 32 bits long.

    On the IBM System/360, though, it would be something like

    le 12, Y
    me 12, K#0001
    ae 12, Z
    ste 12, x

    All four instructions are memory-reference instructions, so they're also
    32 bits long.

    How would I do this on Concertina II?

    Well, since the sequence has to start with a memory-reference, I can't use the zero-overhead header (Type I). Instead, a Type XI header is in order; that specifies a decode field, so that space can be reserved for a pseudo- immediate, and instruction slots can be indicated as containing
    instructions from the alternate instruction set.

    Then the instructions can be

    lf 6,y
    mfr 6,#2.78
    af 6,z
    stf 6,x

    with the instruction "af" coming from the alternate 32-bit instruction set.

    The other tricky precondition that must be met is to store z in a data region that is only 4,096 bytes or less in size, prefaced with

    USING *,23

    or another register from 17 to 23 could be used as the base register, so that it is addressed with a 12-bit displacement. (Also, register 6, from
    the first eight registers, is used to do the arithmetic to meet the limitations of the "add floating" memory to register operate instruction
    in the alternate instruction set.)

    Because it uses a pseudo-immediate, which gets fetched along with the instruction stream, where the 360 uses a constant, it has an advantage
    over the 360. On the other hand, while the actual code is the same length, there's also the 32-bit overhead of the header.

    John Savard


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Aug 24 19:50:44 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 8/5/2025 11:51 AM, Stephen Fuld wrote:
    On 8/4/2025 9:56 PM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    And... would you like to have a stack in your architecture?

    No.

    OK.  I think that is the final nail in the coffin, I will
    henceforth stop reading (and writing) about your architecture.

    While I agree that having at least push and pop instructions would be beneficial, I hardly think that is the most "bizarre" and less than
    useful aspect of John's architecture.  After all, both of those instructions can be accomplished by two "standard" instructions, a store and an add (for push) and a load and subtract (for pop).  Interchange
    the add and the subtract if you want the stack to grow in the other direction.

    Of course, you are free to stop contributing on this topic, but I, for one, will miss your contributions.



    The lack of dedicated PUSH/POP instructions IME has relatively little
    direct impact on the usability of an ISA. Either way, one is likely to
    need stack-frame adjustment, in which case PUSH/POP don't tend to offer
    much over normal Load/Store instructions.

    When I looked at this at AMD circa 2000, I found many Pushes/Pops occurred
    in short sequences of 2-4; like:

    Push EAX
    Push EBP
    Push ECX

    a) we should note pushes are serially dependent on the decrement of SP
    b) and so are the memory references

    But we could change these into::

    ST EAX,[SP-8]
    ST EBP,[SP-16]
    ST ECX,[SP-24]
    SUB Sp,SP,24

    a) now all the memory references are parallel
    b) there is only one alteration of SP
    c) all 4 instructions can start simultaneously
    So, latency goes from 3 to 1.

    That said, a lot of John's other ideas come off to me like straight up absurdity. So, I wouldn't hold up much hope personally for it to turn
    into much usable.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 24 16:21:06 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    BGB <cr88192@gmail.com> posted:
    The lack of dedicated PUSH/POP instructions IME has relatively little
    direct impact on the usability of an ISA. Either way, one is likely to
    need stack-frame adjustment, in which case PUSH/POP don't tend to offer
    much over normal Load/Store instructions.

    When I looked at this at AMD circa 2000, I found many Pushes/Pops occurred
    in short sequences of 2-4; like:

    Push EAX
    Push EBP
    Push ECX

    a) we should note pushes are serially dependent on the decrement of SP
    b) and so are the memory references

    But we could change these into::

    ST EAX,[SP-8]
    ST EBP,[SP-16]
    ST ECX,[SP-24]
    SUB Sp,SP,24

    a) now all the memory references are parallel
    b) there is only one alteration of SP
    c) all 4 instructions can start simultaneously
    So, latency goes from 3 to 1.

    Except storing below the SP is not interrupt safe without
    something special like defining a safe "red zone" below it.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Aug 29 15:31:32 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-08-01 5:04 p.m., John Savard wrote:
    On Fri, 01 Aug 2025 18:08:17 +0000, Thomas Koenig wrote:

    Question: Do the pointers point to the same block only, or also to other >> blocks? With 5 bits, you could address others as well. Can you give an
    example of their use, including the block headers?

    Actually, no, 5 bits are only enough to point within the same block.
    That's because it's a byte pointer, as it can be used to point to any type of constant, including single byte constants.

    This is despite the fact that I do have an instruction format for conventional style byte immediates (and I've just squeezed in one for 16-bit immediates as well).

    However, they _can_ point to another block, by means of a sixth bit that some instructions have... but when this happens, it does not trigger an extra fetch from memory. Instead, the data is retrieved from a copy of an earlier block in the instruction stream that's saved in a special register... so as to reduce potential NOP-style problems.

    John Savard

    I tried something similar to this but without block headers and it
    worked okay. But there were a couple of issues. One was the last
    instruction in cache line could not have an immediate. Or instructions
    had to stop before the end of the cache line to accommodate immediates.
    This resulted in some wasted space. There would sometimes be a 32-bit
    hole between the last instruction and the first immediate. I used a
    four-bit index and 32-bit immediate, instruction word size. Four bits
    was enough for a 512-bit (cache line size). IIRC the wasted space was
    about 5%.

    We really don't want to waste space.

    It made the assembler more complex. I had immediates being positioned
    from the far end of the cache line down (like a stack) towards the instructions which began at the lower end. The assembler had to be able
    to keep track of where things were on the cache line and the assembler
    was not built to handle that.
    Also, it made reading listings more difficult as constants were in the middle of sequences of instructions.

    We really don't want to make it any harder to read ASM code.

    Sometimes constants could be shared, but this turned out to be not
    possible in many cases as the assembler needed to emit relocation
    records for some constants and it could not handle having two or more instructions pointing to the same constant.

    All the more reason to place the constant in the instruction stream.
    a) never wastes space*
    b) ASM readability

    (*) never wastes space refers to placement of constant, not that the constant-container is optimal for the placed constant.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Aug 29 19:35:15 2025
    From Newsgroup: comp.arch


    Lawrence D'Oliveiro <ldo@nz.invalid> posted:

    On Fri, 1 Aug 2025 15:11:49 -0000 (UTC), John Savard wrote:

    Well, that pointer - five bits long - is an awfully short pointer. Where does it point?

    Instructions are fetched in blocks that are 256 bits long. One of the things this allows for is for the block to begin with a header that specifies that a certain number of 32-bit instruction slots at the end
    of the current block are to be skipped over in the sequence of
    instructions to be executed; this space can be used for constants.

    Just add a couple of modifier bits: one is the indirect bit, indicating
    that the location referenced contains the address of the value, not the value itself, and another “page zero” bit, which indicates that the location is not in the current block, but in another block at a fixed address ...

    What is the purported advantage of using a header instead of just having
    each instruction define its own length ?? and contain its own constants?

    ... and I start having PDP-8 flashbacks.

    As well you should.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Sep 3 18:26:18 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/4/2025 9:56 PM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    And... would you like to have a stack in your architecture?

    No.

    OK. I think that is the final nail in the coffin, I will
    henceforth stop reading (and writing) about your architecture.

    While I agree that having at least push and pop instructions would be beneficial,

    AMD's (and my) K9 translated push and pop into ST and LD followed
    by sub/add, and then peephole combined the several adds so that a
    sequence of instructions::

    Push RAX
    Push RCX
    Push RDX

    became a parallel list of Operations::

    ST RAX,[SP-8]
    ST RCX,[SP-16]
    ST RDX,[SP-24]
    SUB SP,SP,#24

    Taking a data-dependent series of instructions (minimum of 3 cycles)
    and allowing all of them begin execution in the same cycle. This is
    the fallacy of {push, pop, (Rx)++, --(Rx), and similar}. With GBOoO
    it is data-dependent latency that maters, not instruction count.

    I hardly think that is the most "bizarre" and less than
    useful aspect of John's architecture.

    Push and Pop only scratch the surface.

    After all, both of those
    instructions can be accomplished by two "standard" instructions, a store
    and an add (for push) and a load and subtract (for pop). Interchange
    the add and the subtract if you want the stack to grow in the other direction.

    Which we quit doing 30-odd years ago.

    Of course, you are free to stop contributing on this topic, but I, for
    one, will miss your contributions.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Sep 3 14:55:39 2025
    From Newsgroup: comp.arch

    On 9/3/2025 1:26 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/4/2025 9:56 PM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    And... would you like to have a stack in your architecture?

    No.

    OK. I think that is the final nail in the coffin, I will
    henceforth stop reading (and writing) about your architecture.

    While I agree that having at least push and pop instructions would be
    beneficial,

    AMD's (and my) K9 translated push and pop into ST and LD followed
    by sub/add, and then peephole combined the several adds so that a
    sequence of instructions::

    Push RAX
    Push RCX
    Push RDX

    became a parallel list of Operations::

    ST RAX,[SP-8]
    ST RCX,[SP-16]
    ST RDX,[SP-24]
    SUB SP,SP,#24

    Taking a data-dependent series of instructions (minimum of 3 cycles)
    and allowing all of them begin execution in the same cycle. This is
    the fallacy of {push, pop, (Rx)++, --(Rx), and similar}. With GBOoO
    it is data-dependent latency that maters, not instruction count.


    Though, this can be more an argument that having PUSH and POP is not worthwhile. Maybe they help slightly with code density, but this is
    about it.


    They were something I ended up dropping earlier on, as it started to
    become obvious at the time that having them was net negative.


    It is possible that the assembler can fake them as pseudo-instructions,
    but even this doesn't seem worthwhile to do so.

    Well, nevermind if I did eventually go and add the logic in the
    assembler to fake auto-increment addressing modes in RISC-V and similar.

    So, say:
    MOV.L @R10+, R13
    MOV.L R13, @-R11
    Or:
    MOV.L (R10)+, R13
    MOV.L R13, -(R11)

    Will at least work (by cracking each into multiple instructions), but, ...


    Still part of the ongoing tension of BGBCC targeting RV while using AT&T
    style ASM syntax (and, for ASM fragments, it trying to infer the operand ordering per fragment based on which nmemonics are used, which isn't
    helped by my specs sort of mixing the use of mnemonics). If there isn't
    enough to infer a choice based off of, it defaulting to the AT&T style
    operand ordering.

    Possible foot-guns all around here, but lack a better solution ATM.


    I hardly think that is the most "bizarre" and less than
    useful aspect of John's architecture.

    Push and Pop only scratch the surface.

    After all, both of those
    instructions can be accomplished by two "standard" instructions, a store
    and an add (for push) and a load and subtract (for pop). Interchange
    the add and the subtract if you want the stack to grow in the other
    direction.

    Which we quit doing 30-odd years ago.


    I think the usual argument for grows-upwards stack being that
    (presumably) it makes it less likely that a buffer overflow will hit the saved-registers area.

    But, pretty much everyone settled on grows-down stack.

    On a RISC-style ISA, main difference it makes is that the OS and ABI
    need to agree on which direction the stack goes. Though, in theory,
    assuming it were an ABI choice, a flag in the binary or similar could be
    used to signal stack direction. Granted, and DLLs/SOs would also need to
    agree which way the stack goes.



    Actually, could almost make a case for big-endian support, with binaries setting a flag for big-endian, and some sort of CPU control flag to set operation into big-endian mode.

    But, say, having a mismatch between OS and application endianess is
    asking for a mess.

    Less awful being that pretty much everything defaults to little-endian,
    but then having ISA support for endian-swapping, and some way to flag
    data as big-endian.

    FWIW, BGBCC has a __bigendian modifier, but this is pretty nonstandard
    (and not well tested).
    IIRC, ATM, would need to be applied to every member in a struct for a
    fully BE struct, but could maybe make sense to allow it to apply to a
    whole struct (in a similar way to "__packed" or "__attribute__((packed))").

    Mostly only applies to struct members and pointers, and only for integer types. Would also suck on RISC-V, which lacks any good way to do endian swapping.


    But, at least in the case of big-endian, it is commonly used in network protocols and some file formats, so not completely useless.


    Of course, you are free to stop contributing on this topic, but I, for
    one, will miss your contributions.


    --- Synchronet 3.21a-Linux NewsLink 1.2