• Split instruction and immediate stream

    From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Mar 8 14:21:51 2025
    From Newsgroup: comp.arch

    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream. It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Mar 8 17:53:34 2025
    From Newsgroup: comp.arch

    On Sat, 8 Mar 2025 14:21:51 +0000, Thomas Koenig wrote:

    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream. It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I knew a guy with that name at AMD--he did microcode--and did it well.

    I think the problem the author is trying to solve is better addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    I took a quick look, and it seems that
    a) too few registers
    b) too many OpCode bits
    although it does look easy to parse.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Mar 8 14:15:16 2025
    From Newsgroup: comp.arch

    On 3/8/2025 11:53 AM, MitchAlsup1 wrote:
    On Sat, 8 Mar 2025 14:21:51 +0000, Thomas Koenig wrote:

    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream.  It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I knew a guy with that name at AMD--he did microcode--and did it well.


    This was also posted to the RISC-V mailing list...


    I think the problem the author is trying to solve is better addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    I took a quick look, and it seems that
    a) too few registers
    b) too many OpCode bits
    although it does look easy to parse.


    Yeah, a bit of a rebalance is needed...

    The design goes to 12 bit register fields for 64-bit ops, which is just absurd, and doesn't really leave enough bits for immediate encodings in
    the instruction formats.




    If I were to do a vaguely similar design, probably:
    Bit 0 of each 16-bit word indicates a following word is present;
    16-bit ops have 2R with 16 registers;
    32-bit ops have 3R with 64 registers.


    Say, 16b:
    zzzz-mmmm-nnnn-zzz0 //2R
    zzzz-iiii-nnnn-zzz0 //2RI, Imm4
    zzzz-iiii-iiii-zzz0 //Imm8 (Branch, AddSP)

    Then, 32b:
    mmnn-mmmm-nnnn-zzz1 zzzz-tttt-ttzz-zzz0 //3R
    mmnn-mmmm-nnnn-zzz1 zzzz-iiii-iizz-zzz0 //3RI, Imm6
    mmnn-mmmm-nnnn-zzz1 iiii-iiii-iizz-zzz0 //3RI, Imm10
    iinn-iiii-nnnn-zzz1 iiii-iiii-iizz-zzz0 //2RI, Imm16
    iiii-iiii-iiii-zzz1 iiii-iiii-iizz-zzz0 //Imm22 (Branch)

    Could have 48 and 64 bit encodings, which keep the same base layout as
    the 32-bit ops, but maybe extend immediate and opcode bits.

    Say, 48-bit:
    mmnn-mmmm-nnnn-zzz1 iiii-iiii-iizz-zzz1
    iiii-iiii-iiii-iiz0 //3RI, Imm24

    And, 64-bit:
    mmnn-mmmm-nnnn-zzz1 iiii-iiii-iizz-zzz1
    iiii-iiii-iiii-iiz1 zzzz-iiii-iiii-izz0 //3RI, Imm33


    For register space, might make sense to map the 16-bit ops to R16..R31,
    but then organize the registers such that it has access to both
    callee-save and argument registers.

    Say:
    R0 ..R3 ZR, LR, SP, GP
    R4 ..R15 Callee Save (12)
    R16..R23 Callee Save ( 4)
    R24..R27 Scratch ( 4)
    R28..R31 Args 0..3 ( 4)
    R32..R43 Args 4..15 (12)
    R44..R51 Scratch ( 8)
    R52..R63 Callee Save (12)


    16b opcode map, possible:
    00tt-mmmm-nnnn-0000 //Store (B/W/L/Q), "MOV.x Rn, (Rm)"
    0100-iiii-nnnn-0000 MOV.Q Rn, (SP, Imm4*8)
    0101-iiii-nnnn-0000 MOV.X Xn, (SP, Imm4*8) //Pair
    0110-iiii-nnnn-0000 MOV.Q (SP, Imm4*8), Rn
    0111-iiii-nnnn-0000 MOV.X (SP, Imm4*8), Xn //Pair
    1ttt-mmmm-nnnn-0000 //Load (SB/SW/SL/Q, UB/UW/UL/X)

    0000-mmmm-nnnn-0010 ADD Rm, Rn
    0001-mmmm-nnnn-0010 SUB Rm, Rn
    0010-mmmm-nnnn-0010 ADDSL Rm, Rn
    0011-mmmm-nnnn-0010 SUBSL Rm, Rn
    0100-mmmm-nnnn-0010 -
    0101-mmmm-nnnn-0010 AND Rm, Rn
    0110-mmmm-nnnn-0010 OR Rm, Rn
    0111-mmmm-nnnn-0010 XOR Rm, Rn
    ...

    0000-iiii-nnnn-0100 ADD Imm4u, Rn
    0001-iiii-nnnn-0100 SUB Imm4u, Rn
    0010-iiii-nnnn-0100 ADDSL Imm4u, Rn
    0011-iiii-nnnn-0100 SUBSL Imm4u, Rn
    0100-iiii-iiii-0100 ADD Imm8u*8, SP
    0101-iiii-iiii-0100 SUB Imm8u*8, SP
    0110-iiii-iiii-0100 BRA Imm8u (+512B)
    0111-iiii-iiii-0100 BRA Imm8n (-512B)

    ...

    00nn-iiii-nnnn-1010 ? MOV Imm4u, Yn
    01nn-iiii-nnnn-1010 ? ADD Imm4u, Yn
    10nn-iiii-nnnn-1010 ? MOV Imm4n, Yn
    11nn-iiii-nnnn-1010 ? ADD Imm4n, Yn

    mmnn-mmmm-nnnn-1100 ? MOV Ym, Yn //2R MOV
    mmnn-mmmm-nnnn-1110 ? ADD Ym, Yn //2R ADD

    There are only a few ops which have access to the full GPR space, as
    this is very expensive for 16-bit ops, so best limited to only the most
    common cases.

    ...


    The 32-bit opcode map, not laid out here, would likely be entirely disconnected from the 16-bit map.


    Usual tradeoff though that 16/32/64/48 bit encodings would make
    superscalar more difficult and more expensive than 32/64.

    But, such a layout could potentially be good for code density at least I guess.

    Best I could come up with with a quick/dirty pull it seems...


    Don't have much time right now, so will leave it at this.

    ...

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Mar 8 21:28:25 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream. It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    I have not looked at the link, but I would be quite surprised if the
    idea isn't already covered by one or more Mill patents.

    Mill does indeed split the instruction stream in two, it is one of the enablers for supporting a lot more instructions/cycle.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Mar 8 21:43:40 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream. It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    I have not looked at the link, but I would be quite surprised if the
    idea isn't already covered by one or more Mill patents.

    Mill does indeed split the instruction stream in two, it is one of the enablers for supporting a lot more instructions/cycle.

    I was a bit imprecise - the idea is to have instructions in one
    position, the constants they operate on in the other.

    But speaking of Mill - it's been very quiet for quite some time
    now...
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Mar 8 18:07:53 2025
    From Newsgroup: comp.arch

    On 2025-03-08 9:21 a.m., Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream. It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    Found that post interesting.

    As outlined, the immediate base register requires a double-wide link
    register. This may be okay for code with 32b addresses running in a
    64-bit machine. But otherwise would probably need to go through another
    GPR to manage the immediate base register. It is potentially more
    instructions in the function prolog / epilog code. And more instructions
    at function call.

    I think splitting the code and constant into separate streams requires
    another port(s) on the I$. The port may already be present if jump-through-table, JTT, is supported.

    I guess that the constant tables for a subroutine would be placed either before or after a subroutine. I would not use the constant tables for
    all constants. Small constants are better encoded directly in the
    instruction. That means using bits to select between small constants or relative addresses.

    I think it is better to use a constant prefix / postfix instruction to
    encode larger constants in the instruction stream. Or use a wider
    instruction format. In Q+ constant postfixes can be used to override a register spec, allowing immediate constants to be used with many more instructions.


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Mar 8 23:56:14 2025
    From Newsgroup: comp.arch

    On Sat, 8 Mar 2025 17:53:34 +0000, MitchAlsup1 wrote:

    On Sat, 8 Mar 2025 14:21:51 +0000, Thomas Koenig wrote:

    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream. It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I knew a guy with that name at AMD--he did microcode--and did it well.

    I think the problem the author is trying to solve is better addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    I took a quick look, and it seems that
    a) too few registers
    b) too many OpCode bits
    although it does look easy to parse.

    The length decode is wasteful of bits. There are 4 sizes of instructions
    16, 32, 54, 128 denoted by the first halfword having (respectively)
    00, 01, 10, 11. But successive halfwords contain 2-bits that simply
    waste entropy and could have been used for "other good stuff".

    16-bit instructions get a 5-bit opcode, and the entire 32 instruction
    space is already fully populated.

    32-bit instructions get a 10-bit OpCode space. At this point I should
    note that my entire OpCode instruction space has only 62 instructions.

    64-bit instructions get a 20-bit OpCode space. Nobody is going to need
    1M individual instructions.

    So, a bit of rearrangement would provide for a healthy OpCode space
    and more bits for registers, and possibly a 96-bit instruction in-
    stead of a 128-bit instruction.

    So, we are still missing::
    a) a memory order model
    b) a translation model
    c) atomic instructions
    d) external linkage {code and data}
    e) thread support using his {ip, bp) construct
    f) system call model
    g) debug model
    h) timers and counters
    i) floating point
    ..
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Mar 9 07:17:31 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:

    I think splitting the code and constant into separate streams requires another port(s) on the I$. The port may already be present if jump-through-table, JTT, is supported.

    There is also the problem of additional cache (page, ...) misses with
    the instruction stream. Maybe an extra "constant data" cache?
    That would depend on how far the extra data is from the code.

    But branches are going to be more expensive because it is not
    only the PC that needs to changed, but also the data pointer.

    Thinking about this a bit more... conceptually, this is not so far
    off from the /360 base pointer addressing mode, but with the base
    pointer implied instead of explicit.

    I guess that the constant tables for a subroutine would be placed either before or after a subroutine.

    Like what was usually done for the /360, I believe.

    But much more "fun" could be had if the base pointer was supplied
    by the caller. Want a routine that does something different,
    just call it with a different constant stream for instructions.
    (OK, you could also pass an argument, but that would offer less
    possibilities for quasi self-modifying code).
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Mar 9 03:44:38 2025
    From Newsgroup: comp.arch

    On 3/8/2025 5:07 PM, Robert Finch wrote:
    On 2025-03-08 9:21 a.m., Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream.  It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    Found that post interesting.

    As outlined, the immediate base register requires a double-wide link register. This may be okay for code with 32b addresses running in a 64-
    bit machine. But otherwise would probably need to go through another GPR
    to manage the immediate base register. It is potentially more
    instructions in the function prolog / epilog code. And more instructions
    at function call.

    I think splitting the code and constant into separate streams requires another port(s) on the I$. The port may already be present if jump- through-table, JTT, is supported.


    I found a few of the ideas questionable at best...

    Possibly an IB like use-case could be handled instead by just using it
    as a dedicated base register for constant loads. But, this would have
    similar latency to a traditional constant pool (which also sucks).

    But, if it is directly loaded inline, this could add extra complexity
    and delay to the pipeline.


    It almost seems like a case of "what if we took a constant pool, and
    made it worse...".


    Or, if a constant pool does have a strong enough use-case (say, one
    wants fixed-length 16-bit ops), maybe treat it like a constant pool but
    have a few special case helper ops.

    Say, 16-bit ops:
    MOV.L @IB+, Rn //load and advance 4 bytes
    MOV.Q @IB+, Rn //load and advance 8 bytes
    MOV.L (IB, Disp4n*4), Rn
    MOV.Q (IB, Disp4n*4), Rn
    Where, the displacement is negative to allow repeating a recently seen
    prior value.

    With the usual caveats of supporting auto-increment.


    I guess that the constant tables for a subroutine would be placed either before or after a subroutine. I would not use the constant tables for
    all constants. Small constants are better encoded directly in the instruction. That means using bits to select between small constants or relative addresses.

    I think it is better to use a constant prefix / postfix instruction to encode larger constants in the instruction stream. Or use a wider instruction format. In Q+ constant postfixes can be used to override a register spec, allowing immediate constants to be used with many more instructions.


    Agree...

    If one is already going to have a variable length encoding, why not make
    it have decent inline immediate fields?...




    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Mar 9 05:01:42 2025
    From Newsgroup: comp.arch

    On 3/8/2025 5:56 PM, MitchAlsup1 wrote:
    On Sat, 8 Mar 2025 17:53:34 +0000, MitchAlsup1 wrote:

    On Sat, 8 Mar 2025 14:21:51 +0000, Thomas Koenig wrote:

    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream.  It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I knew a guy with that name at AMD--he did microcode--and did it well.

    I think the problem the author is trying to solve is better addressed by >>> My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    I took a quick look, and it seems that
    a) too few registers
    b) too many OpCode bits
    although it does look easy to parse.

    The length decode is wasteful of bits. There are 4 sizes of instructions
    16, 32, 54, 128 denoted by the first halfword having (respectively)
    00, 01, 10,  11. But successive halfwords contain 2-bits that simply
    waste entropy and could have been used for "other good stuff".


    This is why my "if doing something similar" idea, would to use 1 bit per 16-bit word. Similar effect, less waste.


    16-bit instructions get a 5-bit opcode, and the entire 32 instruction
    space is already fully populated.


    Yeah.

    As can be noted, the design seemed poorly balanced IMO.


    32-bit instructions get a 10-bit OpCode space. At this point I should
    note that my entire OpCode instruction space has only 62 instructions.


    10 bits seems reasonable at least.

    Can note that if one were to have all of RISC-V as 3R ops, there would
    have been 15 bits of opcode...

    Slightly less with 12-bit immediate values, but roughly break-even (in
    terms of entropy cost) with 10 bit immediate values with 6 bit registers.



    Though, they managed to burn through most of it already.

    IMHO, RISC-V was not particularly efficient with their use of encoding
    space. Not so much the core ISA, but more the extensions.



    64-bit instructions get a 20-bit OpCode space. Nobody is going to need
    1M individual instructions.


    Nobody is going to need 12-bit register fields either...


    So, a bit of rearrangement would provide for a healthy OpCode space
    and more bits for registers, and possibly a 96-bit instruction in-
    stead of a 128-bit instruction.


    32/64/96 works well.


    Can note for 16-bit ops, that unless most of the ops are 16-bit, the
    savings are actually fairly modest.

    If 40% of the ops become 16 bit, you save 20%; 60% of the ops saves 40%.

    The question then becomes how much coverage can one get.

    Or, if one saves maybe 10-20% on the size of ".text", if downsides are
    worth it.


    So, we are still missing::
    a) a memory order model
    b) a translation model
    c) atomic instructions
    d) external linkage {code and data}
    e) thread support using his {ip, bp) construct
    f) system call model
    g) debug model
    h) timers and counters
    i) floating point
    ..

    Yeah...

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Mar 9 08:03:22 2025
    From Newsgroup: comp.arch

    On 2025-03-09 4:44 a.m., BGB wrote:
    On 3/8/2025 5:07 PM, Robert Finch wrote:
    On 2025-03-08 9:21 a.m., Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream.  It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better addressed by >>> My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    Found that post interesting.

    As outlined, the immediate base register requires a double-wide link
    register. This may be okay for code with 32b addresses running in a
    64- bit machine. But otherwise would probably need to go through
    another GPR to manage the immediate base register. It is potentially
    more instructions in the function prolog / epilog code. And more
    instructions at function call.

    I think splitting the code and constant into separate streams requires
    another port(s) on the I$. The port may already be present if jump-
    through-table, JTT, is supported.


    I found a few of the ideas questionable at best...

    Possibly an IB like use-case could be handled instead by just using it
    as a dedicated base register for constant loads. But, this would have similar latency to a traditional constant pool (which also sucks).

    But, if it is directly loaded inline, this could add extra complexity
    and delay to the pipeline.


    It almost seems like a case of "what if we took a constant pool, and
    made it worse...".


    Or, if a constant pool does have a strong enough use-case (say, one
    wants fixed-length 16-bit ops), maybe treat it like a constant pool but
    have a few special case helper ops.

    Say, 16-bit ops:
      MOV.L @IB+, Rn   //load and advance 4 bytes
      MOV.Q @IB+, Rn   //load and advance 8 bytes
      MOV.L (IB, Disp4n*4), Rn
      MOV.Q (IB, Disp4n*4), Rn
    Where, the displacement is negative to allow repeating a recently seen
    prior value.

    With the usual caveats of supporting auto-increment.


    I guess that the constant tables for a subroutine would be placed
    either before or after a subroutine. I would not use the constant
    tables for all constants. Small constants are better encoded directly
    in the instruction. That means using bits to select between small
    constants or relative addresses.

    I think it is better to use a constant prefix / postfix instruction to
    encode larger constants in the instruction stream. Or use a wider
    instruction format. In Q+ constant postfixes can be used to override a
    register spec, allowing immediate constants to be used with many more
    instructions.


    Agree...

    If one is already going to have a variable length encoding, why not make
    it have decent inline immediate fields?...




    One thought I had a while ago using a similar technique to glyph's was
    to place constants at the beginning or the end of a cache line. Then the immediate base register is not needed. The relative offsets would be in
    terms of the current cache line. It has a couple of drawbacks though,
    one being the need to branch around the constant data; could be done by carefully maintaining the next fetch address. Another drawback is the
    code is repositionable only at cache-line boundaries. Might make
    assembling / linking code interesting.


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Mar 9 19:23:04 2025
    From Newsgroup: comp.arch

    On Sun, 9 Mar 2025 12:03:22 +0000, Robert Finch wrote:


    One thought I had a while ago using a similar technique to glyph's was
    to place constants at the beginning or the end of a cache line. Then the immediate base register is not needed. The relative offsets would be in
    terms of the current cache line. It has a couple of drawbacks though,
    one being the need to branch around the constant data; could be done by carefully maintaining the next fetch address. Another drawback is the
    code is repositionable only at cache-line boundaries. Might make
    assembling / linking code interesting.

    If you put the constants at the end of the cache line, you will have
    accessed the constants while decoding the instructions and you can
    figure out when to jump to the next cache line without branching.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Mar 9 19:38:46 2025
    From Newsgroup: comp.arch

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sun, 9 Mar 2025 12:03:22 +0000, Robert Finch wrote:


    One thought I had a while ago using a similar technique to glyph's was
    to place constants at the beginning or the end of a cache line. Then the
    immediate base register is not needed. The relative offsets would be in
    terms of the current cache line. It has a couple of drawbacks though,
    one being the need to branch around the constant data; could be done by
    carefully maintaining the next fetch address. Another drawback is the
    code is repositionable only at cache-line boundaries. Might make
    assembling / linking code interesting.

    If you put the constants at the end of the cache line, you will have
    accessed the constants while decoding the instructions and you can
    figure out when to jump to the next cache line without branching.

    Did I mention I would not like to write an assembler for that? :-)
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Sun Mar 9 15:59:57 2025
    From Newsgroup: comp.arch

    On Sun, 9 Mar 2025 07:17:31 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    Robert Finch <robfi680@gmail.com> schrieb:

    I think splitting the code and constant into separate streams requires
    another port(s) on the I$. The port may already be present if
    jump-through-table, JTT, is supported.

    There is also the problem of additional cache (page, ...) misses with
    the instruction stream. Maybe an extra "constant data" cache?
    That would depend on how far the extra data is from the code.

    But branches are going to be more expensive because it is not
    only the PC that needs to changed, but also the data pointer.

    It looks similar to using a constant pool in virtual machine code,
    except that the access is not random but (more or less) sequential as
    straight line code executes.


    Thinking about this a bit more... conceptually, this is not so far
    off from the /360 base pointer addressing mode, but with the base
    pointer implied instead of explicit.

    I guess that the constant tables for a subroutine would be placed either
    before or after a subroutine.

    Like what was usually done for the /360, I believe.

    But much more "fun" could be had if the base pointer was supplied
    by the caller. Want a routine that does something different,
    just call it with a different constant stream for instructions.
    (OK, you could also pass an argument, but that would offer less
    possibilities for quasi self-modifying code).
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Mar 9 17:02:44 2025
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2025-03-08 9:21 a.m., Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream. It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    I think it is better to use a constant prefix / postfix instruction to encode larger constants in the instruction stream. Or use a wider instruction format. In Q+ constant postfixes can be used to override a register spec, allowing immediate constants to be used with many more instructions.

    Yes a kind of prefix instruction that say "here comes an immediate value"
    and loads a 2, 4, or 8 byte immediate with all the sign or zero extend
    options into a special constant register in the Decoder and marks it valid.
    The next instruction just says add immediate "ADDI rd, rs" and it implies
    the constant it just stashed.

    That relieves the consumer opcodes from having to encode all the
    different variable immediate formats.

    It could easily extend to multiple immediate prefix instructions so
    one can have instructions like store immediate STD [rd+imm1], imm2
    by just adding a second constant register to the Decoder.

    The only complication I can see is if the instruction producer-consumer
    pair straddle pages and their is a page fault on the second.
    I wouldn't want to have to save the stashed constant as "thread context"
    so it should roll back to the start of the immediate instruction.
    In which case the faulting RIP is the first instruction and the
    faulting address is someplace in the second.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Mar 9 21:19:41 2025
    From Newsgroup: comp.arch

    On Sun, 9 Mar 2025 21:02:44 +0000, EricP wrote:

    Robert Finch wrote:
    On 2025-03-08 9:21 a.m., Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream. It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better addressed by >>> My 66000 (and I would absolutely _hate_ to write an assembler for it).
    Still, I thought it worth mentioning.

    I think it is better to use a constant prefix / postfix instruction to
    encode larger constants in the instruction stream. Or use a wider
    instruction format. In Q+ constant postfixes can be used to override a
    register spec, allowing immediate constants to be used with many more
    instructions.

    Yes a kind of prefix instruction that say "here comes an immediate
    value"
    and loads a 2, 4, or 8 byte immediate with all the sign or zero extend options into a special constant register in the Decoder and marks it
    valid.
    The next instruction just says add immediate "ADDI rd, rs" and it
    implies
    the constant it just stashed.

    That relieves the consumer opcodes from having to encode all the
    different variable immediate formats.

    It could easily extend to multiple immediate prefix instructions so
    one can have instructions like store immediate STD [rd+imm1], imm2
    by just adding a second constant register to the Decoder.

    The only complication I can see is if the instruction producer-consumer
    pair straddle pages and their is a page fault on the second.
    I wouldn't want to have to save the stashed constant as "thread context"
    so it should roll back to the start of the immediate instruction.

    Execute the instruction and the (preceding) constant as a single
    instruction, so any fault leaves IP pointing at the constant.

    In which case the faulting RIP is the first instruction and the
    faulting address is someplace in the second.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Mar 9 21:21:31 2025
    From Newsgroup: comp.arch

    On Sun, 9 Mar 2025 19:38:46 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sun, 9 Mar 2025 12:03:22 +0000, Robert Finch wrote:


    One thought I had a while ago using a similar technique to glyph's was
    to place constants at the beginning or the end of a cache line. Then the >>> immediate base register is not needed. The relative offsets would be in
    terms of the current cache line. It has a couple of drawbacks though,
    one being the need to branch around the constant data; could be done by
    carefully maintaining the next fetch address. Another drawback is the
    code is repositionable only at cache-line boundaries. Might make
    assembling / linking code interesting.

    If you put the constants at the end of the cache line, you will have
    accessed the constants while decoding the instructions and you can
    figure out when to jump to the next cache line without branching.

    Did I mention I would not like to write an assembler for that? :-)

    What no masochism today ??
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Mar 9 16:27:23 2025
    From Newsgroup: comp.arch

    On 3/9/2025 4:19 PM, MitchAlsup1 wrote:
    On Sun, 9 Mar 2025 21:02:44 +0000, EricP wrote:

    Robert Finch wrote:
    On 2025-03-08 9:21 a.m., Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream.  It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better
    addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it). >>>> Still, I thought it worth mentioning.

    I think it is better to use a constant prefix / postfix instruction to
    encode larger constants in the instruction stream. Or use a wider
    instruction format. In Q+ constant postfixes can be used to override a
    register spec, allowing immediate constants to be used with many more
    instructions.

    Yes a kind of prefix instruction that say "here comes an immediate
    value"
    and loads a 2, 4, or 8 byte immediate with all the sign or zero extend
    options into a special constant register in the Decoder and marks it
    valid.
    The next instruction just says add immediate "ADDI rd, rs" and it
    implies
    the constant it just stashed.

    That relieves the consumer opcodes from having to encode all the
    different variable immediate formats.

    It could easily extend to multiple immediate prefix instructions so
    one can have instructions like store immediate STD [rd+imm1], imm2
    by just adding a second constant register to the Decoder.

    The only complication I can see is if the instruction producer-consumer
    pair straddle pages and their is a page fault on the second.
    I wouldn't want to have to save the stashed constant as "thread context"
    so it should roll back to the start of the immediate instruction.

    Execute the instruction and the (preceding) constant as a single
    instruction, so any fault leaves IP pointing at the constant.


    Yeah, if your prefix is executed separately from the instruction it
    modifies, this adds both performance drawbacks and potential for issues related to exposing architectural state.

    For all concerned parties, the prefix and modified instruction should
    behave like like a single larger instruction.


    In which case the faulting RIP is the first instruction and the
    faulting address is someplace in the second.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Mar 9 17:37:25 2025
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    On Sun, 9 Mar 2025 21:02:44 +0000, EricP wrote:

    Robert Finch wrote:
    On 2025-03-08 9:21 a.m., Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream. It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better
    addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it). >>>> Still, I thought it worth mentioning.

    I think it is better to use a constant prefix / postfix instruction to
    encode larger constants in the instruction stream. Or use a wider
    instruction format. In Q+ constant postfixes can be used to override a
    register spec, allowing immediate constants to be used with many more
    instructions.

    Yes a kind of prefix instruction that say "here comes an immediate
    value"
    and loads a 2, 4, or 8 byte immediate with all the sign or zero extend
    options into a special constant register in the Decoder and marks it
    valid.
    The next instruction just says add immediate "ADDI rd, rs" and it
    implies
    the constant it just stashed.

    That relieves the consumer opcodes from having to encode all the
    different variable immediate formats.

    It could easily extend to multiple immediate prefix instructions so
    one can have instructions like store immediate STD [rd+imm1], imm2
    by just adding a second constant register to the Decoder.

    The only complication I can see is if the instruction producer-consumer
    pair straddle pages and their is a page fault on the second.
    I wouldn't want to have to save the stashed constant as "thread context"
    so it should roll back to the start of the immediate instruction.

    Execute the instruction and the (preceding) constant as a single
    instruction, so any fault leaves IP pointing at the constant.

    Yes, that was implied.
    Decode doesn't spit out a uOp for the producer immediate instruction(s)
    and does for the consumer but with the total length for all.
    Retire adds the total to the committed RIP once for the whole sequence.

    In which case the faulting RIP is the first instruction and the
    faulting address is someplace in the second.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Marcus@m.delete@this.bitsnbites.eu to comp.arch on Sat Mar 22 11:55:57 2025
    From Newsgroup: comp.arch

    Den 2025-03-09 kl. 22:19, skrev MitchAlsup1:
    On Sun, 9 Mar 2025 21:02:44 +0000, EricP wrote:

    Robert Finch wrote:
    On 2025-03-08 9:21 a.m., Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream.  It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better
    addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it). >>>> Still, I thought it worth mentioning.

    I think it is better to use a constant prefix / postfix instruction to
    encode larger constants in the instruction stream. Or use a wider
    instruction format. In Q+ constant postfixes can be used to override a
    register spec, allowing immediate constants to be used with many more
    instructions.

    Yes a kind of prefix instruction that say "here comes an immediate
    value"
    and loads a 2, 4, or 8 byte immediate with all the sign or zero extend
    options into a special constant register in the Decoder and marks it
    valid.
    The next instruction just says add immediate "ADDI rd, rs" and it
    implies
    the constant it just stashed.

    That relieves the consumer opcodes from having to encode all the
    different variable immediate formats.

    It could easily extend to multiple immediate prefix instructions so
    one can have instructions like store immediate STD [rd+imm1], imm2
    by just adding a second constant register to the Decoder.

    The only complication I can see is if the instruction producer-consumer
    pair straddle pages and their is a page fault on the second.
    I wouldn't want to have to save the stashed constant as "thread context"
    so it should roll back to the start of the immediate instruction.

    Execute the instruction and the (preceding) constant as a single
    instruction, so any fault leaves IP pointing at the constant.

    Then we have the page-crossing issue. Is it better to force the compiler/assembler to align such instructions so that they never cross
    page boundaries?



    In which case the faulting RIP is the first instruction and the
    faulting address is someplace in the second.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Mar 22 15:04:31 2025
    From Newsgroup: comp.arch

    Marcus <m.delete@this.bitsnbites.eu> schrieb:

    Then we have the page-crossing issue. Is it better to force the compiler/assembler to align such instructions so that they never cross
    page boundaries?

    Power 10 chose to do so; actually, larger instructions cannot
    cross a (likely) Cache line size there. According to the Power
    ISA Version 3.1, section 1.6:

    "Prefixed instructions do not cross 64-byte instruction address
    boundaries. When a prefixed instruction crosses a 64-byte boundary,
    the system alignment error handler is invoked."
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Mar 22 23:06:34 2025
    From Newsgroup: comp.arch

    On 2025-03-22 11:04 a.m., Thomas Koenig wrote:
    Marcus <m.delete@this.bitsnbites.eu> schrieb:

    Then we have the page-crossing issue. Is it better to force the
    compiler/assembler to align such instructions so that they never cross
    page boundaries?

    Power 10 chose to do so; actually, larger instructions cannot
    cross a (likely) Cache line size there. According to the Power
    ISA Version 3.1, section 1.6:

    "Prefixed instructions do not cross 64-byte instruction address
    boundaries. When a prefixed instruction crosses a 64-byte boundary,
    the system alignment error handler is invoked."

    In the latest test project, the LB650 similar to a PowerPC, large
    constants are encoded at the end of the cache line. So, there is a
    similar issue of code running into the constant area.

    I have the assembler moving the code that overlaps to the next cache line.

    It is confusing to look at listing files, as there are constants output
    inline with the code. Makes it look like the code should not work. How
    does it know where to go for the next instruction? Is the question that
    comes to mind.

    For now, the hardware decoder takes the cheezy approach of marking instructions fetched in the constant area as invalid. The constant area
    gets fetched and loaded into the pipeline, but as NOPs.

    It is quite a trick getting the assembler to place constants at the end
    of the cache line and generate references to the constants. It is
    interesting because I have *constants* being relocated by the assembler
    / linker. Normally there would not be a relocation associated with a
    constant. A relocation reference to the constant is spit out by the
    assembler, and the linker updates the index to the constant in the code.

    It does not quite work yet. Constants are placed and code is moved, but
    the linked program does not have the correct references yet.

    Experimental, but looking like things will work.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Marcus@m.delete@this.bitsnbites.eu to comp.arch on Sun Mar 23 13:12:25 2025
    From Newsgroup: comp.arch

    Den 2025-03-23 kl. 04:06, skrev Robert Finch:
    On 2025-03-22 11:04 a.m., Thomas Koenig wrote:
    Marcus <m.delete@this.bitsnbites.eu> schrieb:

    Then we have the page-crossing issue. Is it better to force the
    compiler/assembler to align such instructions so that they never cross
    page boundaries?

    Power 10 chose to do so; actually, larger instructions cannot
    cross a (likely) Cache line size there.  According to the Power
    ISA Version 3.1, section 1.6:

    "Prefixed instructions do not cross 64-byte instruction address
    boundaries. When a prefixed instruction crosses a 64-byte boundary,
    the system alignment error handler is invoked."

    In the latest test project, the LB650 similar to a PowerPC, large
    constants are encoded at the end of the cache line. So, there is a
    similar issue of code running into the constant area.

    I have the assembler moving the code that overlaps to the next cache line.

    It is confusing to look at listing files, as there are constants output inline with the code. Makes it look like the code should not work. How
    does it know where to go for the next instruction? Is the question that comes to mind.

    For now, the hardware decoder takes the cheezy approach of marking instructions fetched in the constant area as invalid. The constant area
    gets fetched and loaded into the pipeline, but as NOPs.

    It is quite a trick getting the assembler to place constants at the end
    of the cache line and generate references to the constants. It is interesting because I have *constants* being relocated by the assembler
    / linker. Normally there would not be a relocation associated with a constant. A relocation reference to the constant is spit out by the assembler, and the linker updates the index to the constant in the code.

    It does not quite work yet. Constants are placed and code is moved, but
    the linked program does not have the correct references yet.

    Experimental, but looking like things will work.


    Although I have not tried any of these techniques, here are my thoughts.

    Why not always place the constant next to (right after) the instruction
    that references it, instead of at an offset within the cache line?

    The effect should be very similar, but now you have a simpler offset
    (it's always zero) and you eliminate the problem with having to keep
    track of where the constants are in order to prevent the PC/IP from
    running into the constant area.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Mar 23 13:44:23 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:

    In the latest test project, the LB650 similar to a PowerPC, large
    constants are encoded at the end of the cache line. So, there is a
    similar issue of code running into the constant area.

    What is your motivation for this?

    If you have an instruction including constant(s) which no longer
    fits your cache line (say, 8 bytes left and 12 bytes needed)
    it does not matter where you put the constants and where you
    put the instructions - it will not fit, and you have to start
    a new cache line.

    I am not seeing an advantage over what Power 10 does, which is
    just to add a NOP at the end if things don't fit on a cacheline.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Mar 23 09:44:04 2025
    From Newsgroup: comp.arch

    Marcus wrote:
    Den 2025-03-09 kl. 22:19, skrev MitchAlsup1:
    On Sun, 9 Mar 2025 21:02:44 +0000, EricP wrote:

    Robert Finch wrote:
    On 2025-03-08 9:21 a.m., Thomas Koenig wrote:
    There was a recent post to the gcc mailing list which showed
    interesting concept of dealing with large constants in an ISA:
    Splitting a the instruction and constant stream. It can be found
    at https://github.com/michaeljclark/glyph/ , and is named "glyph".

    I think the problem the author is trying to solve is better
    addressed by
    My 66000 (and I would absolutely _hate_ to write an assembler for it). >>>>> Still, I thought it worth mentioning.

    I think it is better to use a constant prefix / postfix instruction to >>>> encode larger constants in the instruction stream. Or use a wider
    instruction format. In Q+ constant postfixes can be used to override a >>>> register spec, allowing immediate constants to be used with many more
    instructions.

    Yes a kind of prefix instruction that say "here comes an immediate
    value"
    and loads a 2, 4, or 8 byte immediate with all the sign or zero extend
    options into a special constant register in the Decoder and marks it
    valid.
    The next instruction just says add immediate "ADDI rd, rs" and it
    implies
    the constant it just stashed.

    That relieves the consumer opcodes from having to encode all the
    different variable immediate formats.

    It could easily extend to multiple immediate prefix instructions so
    one can have instructions like store immediate STD [rd+imm1], imm2
    by just adding a second constant register to the Decoder.

    The only complication I can see is if the instruction producer-consumer
    pair straddle pages and their is a page fault on the second.
    I wouldn't want to have to save the stashed constant as "thread context" >>> so it should roll back to the start of the immediate instruction.

    Execute the instruction and the (preceding) constant as a single
    instruction, so any fault leaves IP pointing at the constant.

    Then we have the page-crossing issue. Is it better to force the compiler/assembler to align such instructions so that they never cross
    page boundaries?


    In which case the faulting RIP is the first instruction and the
    faulting address is someplace in the second.

    Its an unnecessary restriction as I point out above and in another msg,
    the core doesn't increment the committed RIP for the constant instruction
    size and adds its size to the following consumer instruction,
    effectively turning the pair into a variable length instruction.
    And straddle faults can already happen with variable length instructions.
    If it faults on a straddle then the RIP rolls back to point at the
    constant and the faulting address points at the next page.



    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Mar 23 16:51:40 2025
    From Newsgroup: comp.arch

    On Sun, 23 Mar 2025 12:12:25 +0000, Marcus wrote:

    Den 2025-03-23 kl. 04:06, skrev Robert Finch:
    On 2025-03-22 11:04 a.m., Thomas Koenig wrote:
    Marcus <m.delete@this.bitsnbites.eu> schrieb:

    Then we have the page-crossing issue. Is it better to force the
    compiler/assembler to align such instructions so that they never cross >>>> page boundaries?

    Power 10 chose to do so; actually, larger instructions cannot
    cross a (likely) Cache line size there.  According to the Power
    ISA Version 3.1, section 1.6:

    "Prefixed instructions do not cross 64-byte instruction address
    boundaries. When a prefixed instruction crosses a 64-byte boundary,
    the system alignment error handler is invoked."

    In the latest test project, the LB650 similar to a PowerPC, large
    constants are encoded at the end of the cache line. So, there is a
    similar issue of code running into the constant area.

    I have the assembler moving the code that overlaps to the next cache
    line.

    It is confusing to look at listing files, as there are constants output
    inline with the code. Makes it look like the code should not work. How
    does it know where to go for the next instruction? Is the question that
    comes to mind.

    For now, the hardware decoder takes the cheezy approach of marking
    instructions fetched in the constant area as invalid. The constant area
    gets fetched and loaded into the pipeline, but as NOPs.

    It is quite a trick getting the assembler to place constants at the end
    of the cache line and generate references to the constants. It is
    interesting because I have *constants* being relocated by the assembler
    / linker. Normally there would not be a relocation associated with a
    constant. A relocation reference to the constant is spit out by the
    assembler, and the linker updates the index to the constant in the code.

    It does not quite work yet. Constants are placed and code is moved, but
    the linked program does not have the correct references yet.

    Experimental, but looking like things will work.


    Although I have not tried any of these techniques, here are my thoughts.

    Why not always place the constant next to (right after) the instruction
    that references it, instead of at an offset within the cache line?

    The effect should be very similar, but now you have a simpler offset
    (it's always zero) and you eliminate the problem with having to keep
    track of where the constants are in order to prevent the PC/IP from
    running into the constant area.

    You also don't need to worry about cache line (i.e., artificial)
    boundaries.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Mar 23 13:54:19 2025
    From Newsgroup: comp.arch

    On 2025-03-23 8:12 a.m., Marcus wrote:
    Den 2025-03-23 kl. 04:06, skrev Robert Finch:
    On 2025-03-22 11:04 a.m., Thomas Koenig wrote:
    Marcus <m.delete@this.bitsnbites.eu> schrieb:

    Then we have the page-crossing issue. Is it better to force the
    compiler/assembler to align such instructions so that they never cross >>>> page boundaries?

    Power 10 chose to do so; actually, larger instructions cannot
    cross a (likely) Cache line size there.  According to the Power
    ISA Version 3.1, section 1.6:

    "Prefixed instructions do not cross 64-byte instruction address
    boundaries. When a prefixed instruction crosses a 64-byte boundary,
    the system alignment error handler is invoked."

    In the latest test project, the LB650 similar to a PowerPC, large
    constants are encoded at the end of the cache line. So, there is a
    similar issue of code running into the constant area.

    I have the assembler moving the code that overlaps to the next cache
    line.

    It is confusing to look at listing files, as there are constants
    output inline with the code. Makes it look like the code should not
    work. How does it know where to go for the next instruction? Is the
    question that comes to mind.

    For now, the hardware decoder takes the cheezy approach of marking
    instructions fetched in the constant area as invalid. The constant
    area gets fetched and loaded into the pipeline, but as NOPs.

    It is quite a trick getting the assembler to place constants at the
    end of the cache line and generate references to the constants. It is
    interesting because I have *constants* being relocated by the
    assembler / linker. Normally there would not be a relocation
    associated with a constant. A relocation reference to the constant is
    spit out by the assembler, and the linker updates the index to the
    constant in the code.

    It does not quite work yet. Constants are placed and code is moved,
    but the linked program does not have the correct references yet.

    Experimental, but looking like things will work.


    Although I have not tried any of these techniques, here are my thoughts.

    Why not always place the constant next to (right after) the instruction
    that references it, instead of at an offset within the cache line?

    That is a very good idea. It is the same thing almost as using a
    variable length instruction.

    LB650 uses a smaller constant packet (16-bits) than the instruction. So, instructions would need to be able to be aligned at 16-bit boundaries.
    LB650 instruction are fixed 32-bit. There is also the possibility of
    sharing the same constants, although slim.

    The effect should be very similar, but now you have a simpler offset
    (it's always zero) and you eliminate the problem with having to keep
    track of where the constants are in order to prevent the PC/IP from
    running into the constant area.

    I wish I had thought of that last night. But I have coded things now.
    Got the compiler / assembler going. The listings are a few percent
    shorter than the PowerPC. It may be due to bugs yet. I think the
    difference may be the PowerPC burns up bits using pairs of instructions
    for high/low halves of the constant.

    The vbcc compiler for the PowerPC was modified.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Mar 23 20:02:59 2025
    From Newsgroup: comp.arch

    On Sun, 23 Mar 2025 17:54:19 +0000, Robert Finch wrote:

    On 2025-03-23 8:12 a.m., Marcus wrote:
    Den 2025-03-23 kl. 04:06, skrev Robert Finch:
    On 2025-03-22 11:04 a.m., Thomas Koenig wrote:
    Marcus <m.delete@this.bitsnbites.eu> schrieb:

    It does not quite work yet. Constants are placed and code is moved,
    but the linked program does not have the correct references yet.

    Experimental, but looking like things will work.


    Although I have not tried any of these techniques, here are my thoughts.

    Why not always place the constant next to (right after) the instruction
    that references it, instead of at an offset within the cache line?

    That is a very good idea. It is the same thing almost as using a
    variable length instruction.

    LB650 uses a smaller constant packet (16-bits) than the instruction. So, instructions would need to be able to be aligned at 16-bit boundaries.
    LB650 instruction are fixed 32-bit. There is also the possibility of
    sharing the same constants, although slim.

    Consider stack pointer displacements:: local x is always [SP,offset(X)]
    since these are mostly 16-bit displacements, they are already optimal.

    But consider local-static data:: displacement( x ) varies with IP,
    unless the compiler 'eats' an instruction to load the address--which
    is generally not that great of an idea.

    Then consider extern-global data:: you are likely using 32-bit
    (tiny-medium model) or 64-bit (large and huge model). The 32-bit
    form is arguably IP-relative, the 64-bit form can be argued either
    way.

    My overall argument is that there are unlikely to be "all that many"
    exactly equal constants in a dozen instructions that is that cache
    line.

    The effect should be very similar, but now you have a simpler offset
    (it's always zero) and you eliminate the problem with having to keep
    track of where the constants are in order to prevent the PC/IP from
    running into the constant area.

    I wish I had thought of that last night. But I have coded things now.

    Never let a done (but mediocre) solution prevent better solutions--
    that is for your Boss to decide.

    Got the compiler / assembler going. The listings are a few percent
    shorter than the PowerPC. It may be due to bugs yet. I think the
    difference may be the PowerPC burns up bits using pairs of instructions
    for high/low halves of the constant.

    The vbcc compiler for the PowerPC was modified.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Mar 23 19:13:25 2025
    From Newsgroup: comp.arch

    On 2025-03-23 9:44 a.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:

    In the latest test project, the LB650 similar to a PowerPC, large
    constants are encoded at the end of the cache line. So, there is a
    similar issue of code running into the constant area.

    What is your motivation for this?

    This was more of a two day experiment. Wanted to see if code density
    could be improved.

    I have gone back to Q+ which maybe has better code density.
    Got a better result after putting more work into the assembler.
    For my serial I/O routines:

    PowerPC: 1624 bytes (compiled with vbcc)
    Q+: 1456 bytes (arpl compiler)

    I am guessing most of the gain is from function prolog / epilog where Q+
    has enter / leave instructions. There is still a couple of issues in the
    arpl compiler, it outputs more instructions than it needs to. Those
    24-bit instructions work well.


    If you have an instruction including constant(s) which no longer
    fits your cache line (say, 8 bytes left and 12 bytes needed)
    it does not matter where you put the constants and where you
    put the instructions - it will not fit, and you have to start
    a new cache line.

    Yes.

    I am not seeing an advantage over what Power 10 does, which is
    just to add a NOP at the end if things don't fit on a cacheline.

    NOP ramps in my parlance. I use them to handle crossing page boundaries.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Mar 24 06:47:42 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    On 2025-03-23 8:12 a.m., Marcus wrote:
    Why not always place the constant next to (right after) the instruction
    that references it, instead of at an offset within the cache line?

    That is a very good idea. It is the same thing almost as using a
    variable length instruction.

    b16 (both b16-dsp and b16-small) puts 4 5-bit instructions in a 16-bit
    word (for an instruction in slot 0, 4 of the 5 bits are 0, so it can
    only be either a nop or a call).

    The lit and litc instructions take their operands from the next word
    in the instruction stream (it seems to me that litc,which only takes
    one byte from the instruction stream, makes sense only if there are
    two litc's in one instruction word, because the program counter has to
    be 16-bit-aligned on the next instruction fetch.

    Control-flow instructions take take their target address from the rest
    of the instruction word (i.e., a call in the first slot has a 15-bit
    target, a jump in slot 2 has a 5-bit target); a control-flow
    instruction in slot 3 (no bits left in the instruction word) takes the
    address from the stack.

    This is all designed for scrictly sequential decoding and processing,
    with no pipielining. Bernd Paysan reports 200MHz for b16-dsp and
    150MHz for b16-small in XC035, a 0.35um process. For comparison, the
    P54C (Pentium) reached up to 200MHz and Klamath (Pentium II) up to
    300M in 0.35um, but with pipelining, but also providing lots of performance-enhancing features; and they were 32-bit CPUs, while b16
    is 16-bit.

    Various information on the b16 can be found at <https://bernd-paysan.de/b16.html>.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Mar 24 04:00:14 2025
    From Newsgroup: comp.arch

    On 3/23/2025 8:44 AM, Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:

    In the latest test project, the LB650 similar to a PowerPC, large
    constants are encoded at the end of the cache line. So, there is a
    similar issue of code running into the constant area.

    What is your motivation for this?

    If you have an instruction including constant(s) which no longer
    fits your cache line (say, 8 bytes left and 12 bytes needed)
    it does not matter where you put the constants and where you
    put the instructions - it will not fit, and you have to start
    a new cache line.

    I am not seeing an advantage over what Power 10 does, which is
    just to add a NOP at the end if things don't fit on a cacheline.


    I ended up with a vaguely related issue in XG1, where a 96-bit encoding
    at a certain offset would not work within a 128-bit fetch with 64-bit alignment.

    Workaround was that, in the odd case this scenario occurred, to insert a 16-bit NOP.

    This issue does not occur with XG2, XG3, or RV+Jx. In XG2 or XG3 modes
    32 bit alignment is required, at which point it is not possible for a 96
    bit fetch to span 3 QWORDs.

    At present it doesn't occur in RV+JX both because BGBCC doesn't yet
    support the 'C' extension (in any form that works), and also because
    support for 96-bit encodings was made non-default (*1).

    *1: Thus far, all it can really encode is a 64-bit constant load, and
    64-bit constant load isn't common enough by itself to justify the added
    issues of dealing with 96-bit cases (instead, 64b constant loads can be
    dealt with by using two 64-bit instructions).



    But, yeah, can note:
    Verilog style bit-manipulation has seen some use in BGBCC, and has the
    merit that in some cases it can generate faster code than traditional C
    style bit manipulation.


    For example, for repacking RGB555 to a 10-bit format for a table lookup:
    v=(_UBitInt(10)) { rgb[14:12], rgb[ 9: 6], rgb[ 4: 2] };
    v=lut[v];

    Turned out to be notably faster (with the BITMOV) instruction, if
    compared with:
    cr=(rgb>>12)&7;
    cg=(rgb>> 6)&15;
    cb=(rgb>> 2)&7;
    v=(cr<<7)|(cg<<3)|cb;
    v=lut[v];

    Mostly in relation to RGB555 -> Indexed-color conversion.

    Granted, it is still slower than it might have been to have dedicated
    RGB conversion operations, but a lot more generic.


    Though, it is looking like the dedicated palette-conversion instruction
    I added before might be in-fact too limiting (since it effectively only
    works with a particular palette), and it may be more reasonable to drop
    it, and switch to lookup table and a "slightly less niche" instruction
    for repacting RGB555 into a 9 or 10 bit format to feed through a palette conversion lookup.

    A 15-bit lookup table is slower due to L1 misses (whereas, 512B or 1K
    has an easier time staying in the L1 cache). I had also noted that
    RGB343 seemingly has a better accuracy at indexed color lookup than
    RGB333 while cheaper than RGB444 (4K lookup).

    The most likely option at the moment is an instruction to repack, say:
    rrrrrgggggbbbbb
    Into, say:
    grbgrbgrbgrbgrb
    Which could at least have multiple use-cases (more so if an instruction
    exists to also switch it back into the usual RGB555 ordering).



    Decided to leave off going into specifics of possible considered tweaks
    to the design of my 256-color system palette, and stuff about
    color-cells (and the possibility of adding a 1.25 bpp color cell mode).

    Can note though that a color-cell mode with 8x8x1 cells and 2x 8bpp
    endpoints isn't great for image fidelity (well, and the difficulties of
    trying to make the color-cell encoder fast enough that screen refresh
    can happen at a reasonable framerate).

    ...


    --- Synchronet 3.20c-Linux NewsLink 1.2