• "Mini" tags to reduce the number of op codes

    From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Apr 3 09:43:44 2024
    From Newsgroup: comp.arch

    There has been discussion here about the benefits of reducing the number
    of op codes. One reason not mentioned before is if you have fixed
    length instructions, you may want to leave as many codes as possible
    available for future use. Of course, if you are doing a 16-bit
    instruction design, where instruction bits are especially tight, you may
    save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s
    use of tags in registers, but not memory. I worked through this idea
    using the My 6600 as an example “substrate” for two reasons. First, it has several features that are “friendly” to the idea. Second, I know Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea. It is
    certainly not fully worked out. I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If
    set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
    address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register. Non-floating-point loads would clear the
    tag bit. As I show below, I don’t think you need any special "store
    tag" instructions.

    When executing arithmetic instructions, if the tag bits of both sources
    of an instruction are the same, do the appropriate operation (floating
    or integer), and set the tag bit of the result register appropriately.
    If the tag bits of the two sources are different, I see several
    possibilities.

    1. Generate an exception.
    2. Use the sense of source 1 for the arithmetic operation, but perform the appropriate conversion on the second operand first, potentially
    saving an instruction
    3. Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
    point to integer in the above description.)
    4. Same as 2 or 3 above, but don’t do the conversions.

    I suspect this is the least useful choice. I am not sure which is the
    best option.

    Given that, use the same op code for the floating-point and fixed
    versions of the same operations. So we can save eight op codes, the
    four arithmetic operations, max, min, abs and compare. So far, a net
    savings of six opcodes.

    But we can go further. There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts. Given the tag bit, these could share the same
    op-code. There may be several more of these.

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data. But what happens with
    separate compilations? The called function probably doesn’t know the
    tag value for callee saved registers. Fortunately, the My 66000
    architecture comes to the rescue here. You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers
    (yes, it adds 32 bits to that structure – minimal cost). The same
    mechanism works for interrupts that take control away from a running
    process.

    I don’t think you need to set or clear the tag bits without doing
    anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code. For
    example, Oring a register with itself could be used to set the tag bit
    and Oring a register with zero could clear it. These should be pretty rare.

    That is as far as I got. I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad. Is it
    worth it? To me, a major question is the effect on performance. What
    is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
    extra cycle per instruction, then it is almost certainly not worth it.
    IANAHG, so I don’t know. But even if it doesn’t cost any performance, I think the overall gains are pretty small, and probably not worth it
    unless the op-code space is really tight (which, for My 66000 it isn’t).

    Anyway, it has been fun thinking about this, so I hope you don’t mind
    the, probably too long, post.
    Any comments are welcome.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Apr 3 17:24:05 2024
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If >set, the bit indicates that the corresponding register contains a >floating-point value. Clear indicates not floating point (integer,
    address, etc.). There would be two additional instructions, load single >floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register. Non-floating-point loads would clear the
    tag bit. As I show below, I don’t think you need any special "store
    tag" instructions.
    ...
    But we can go further. There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions. And there are some >operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts. Given the tag bit, these could share the same >op-code. There may be several more of these.

    Certainly makes reading disassembler output fun (or writing the
    disassembler). This reminds me of the work on SafeTSA [amme+01] where
    they encode only programs that are correct (according to some notion
    of correctness).

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data. But what happens with >separate compilations? The called function probably doesn’t know the
    tag value for callee saved registers. Fortunately, the My 66000
    architecture comes to the rescue here. You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are >saving or restoring in the same data structure it uses for the registers >(yes, it adds 32 bits to that structure – minimal cost).

    That's expensive in an OoO CPU. There you want each tag to be stored
    alongside with the other 64 bits of the register, because they should
    be renamed at the same time. So the ENTER instruction would depend on
    all the registers that it saves (or maybe on all registers). And upon
    EXIT the restored registers have to be reassembled (which ist not that expensive).

    I have a similar problem for the carry and overflow bits in <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>, and chose to
    let those bits not survive across calls; if there was a cheap solution
    for the problem, it would eliminate this drawback of my idea.

    The same
    mechanism works for interrupts that take control away from a running >process.

    For context switches one cannot get around the problem, but they are
    much rarer than calls and returns, so requiring a pipeline drain for
    them is not so bad.

    Concerning interrupts, as long as nesting is limited, one could just
    treat the physical registers of the interrupted program as taken, and
    execute the interrupt with the remaining physical registers. No need
    to save any architectural registers or their tag, carry, or overflow
    bits.

    That is as far as I got. I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad. Is it
    worth it? To me, a major question is the effect on performance. What
    is the cost of having to decode the source registers and reading their >respective tag bits before knowing which FU to use?

    In in OoO CPU, that's pretty heavy.

    But actually, your idea does not need any computation results for
    determining the tag bits of registers (except during EXIT), so you
    probably can handle the tags in the front end (decoder and renamer).
    Then the tags are really separate and not part of the rgisters that
    have to be renamed, and you don't need to perform any waiting on
    ENTER.

    However, in EXIT the front end would have to wait for the result of
    the load/store unit loading the 32 bits, unless you add a special
    mechanism for that. So EXIT would become expensive, one way or the
    other.

    @InProceedings{amme+01,
    author = {Wolfram Amme and Niall Dalton and Jeffery von Ronne
    and Michael Franz},
    title = {Safe{TSA}: A Type Safe and Referentially Secure
    Mobile-Code Representation Based on Static Single
    Assignment Form},
    crossref = {sigplan01},
    pages = {137--147},
    annote = {The basic ideas in this representation are:
    variables are named as the pair (distance in the
    dominator tree, assignment within basic block);
    variables are separated by type, with operations
    referring only to variables of the right type (like
    integer and FP instructions and registers in
    assemblers); memory references use types to encode
    that a null-pointer check and/or a range check has
    already occured, allowing optimizing these
    operations; the resulting code is encoded (using
    text compression methods) in a way that supports
    only correct code. These ideas are discussed mostly
    in a general way, with some Java-specifics, but the
    representation supposedly also supports Fortran95
    and Ada95. The representation supports some CSE, but
    not for address computation operations. The paper
    also gives numbers on size (usually a little smaller
    than Java bytecode), and some other static metrics,
    especially wrt. the effect of optimizations.}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Apr 3 14:44:27 2024
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the number
    of op codes. One reason not mentioned before is if you have fixed
    length instructions, you may want to leave as many codes as possible available for future use. Of course, if you are doing a 16-bit
    instruction design, where instruction bits are especially tight, you may save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s
    use of tags in registers, but not memory. I worked through this idea
    using the My 6600 as an example “substrate” for two reasons. First, it has several features that are “friendly” to the idea. Second, I know Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea. It is
    certainly not fully worked out. I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
    address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register. Non-floating-point loads would clear the
    tag bit. As I show below, I don’t think you need any special "store
    tag" instructions.

    If you are adding a float/int data type flag you might as well
    also add operand size for floats at least, though some ISA's
    have both int32 and int64 ALU operations for result compatibility.

    When executing arithmetic instructions, if the tag bits of both sources
    of an instruction are the same, do the appropriate operation (floating
    or integer), and set the tag bit of the result register appropriately.
    If the tag bits of the two sources are different, I see several possibilities.

    1. Generate an exception.
    2. Use the sense of source 1 for the arithmetic operation, but
    perform the appropriate conversion on the second operand first,
    potentially saving an instruction
    3. Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
    point to integer in the above description.)
    4. Same as 2 or 3 above, but don’t do the conversions.

    I suspect this is the least useful choice. I am not sure which is the
    best option.

    Given that, use the same op code for the floating-point and fixed
    versions of the same operations. So we can save eight op codes, the
    four arithmetic operations, max, min, abs and compare. So far, a net savings of six opcodes.

    But we can go further. There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts. Given the tag bit, these could share the same op-code. There may be several more of these.

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
    tag value for callee saved registers. Fortunately, the My 66000 architecture comes to the rescue here. You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running process.

    I don’t think you need to set or clear the tag bits without doing
    anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code. For
    example, Oring a register with itself could be used to set the tag bit
    and Oring a register with zero could clear it. These should be pretty
    rare.

    That is as far as I got. I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad. Is it
    worth it? To me, a major question is the effect on performance. What
    is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
    extra cycle per instruction, then it is almost certainly not worth it. IANAHG, so I don’t know. But even if it doesn’t cost any performance, I think the overall gains are pretty small, and probably not worth it
    unless the op-code space is really tight (which, for My 66000 it isn’t).

    Anyway, it has been fun thinking about this, so I hope you don’t mind
    the, probably too long, post.
    Any comments are welcome.

    Currently the opcode data type can tell the uArch how to route
    the operands internally without knowing the data values.
    For example, FPU reservation stations monitor float operands
    and schedule for just the FPU FADD or FMUL units.

    Dynamic data typing would change that to be data dependent routing.
    It means, for example, you can't begin to schedule a uOp
    until you know all its operand types and opcode.

    Looks like it makes such distributed decisions impossible.
    Probably everything winds up in a big pile of logic in the center,
    which might be problematic for those things whose complexity grows N^2.
    Not sure how significant that is.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Apr 3 20:02:25 2024
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    [saving opcodes]


    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
    address, etc.).

    I don't think this would save a lot of opcode space, which
    is the important thing.

    A typical RISC design has a six-bit major opcode.
    Having three registers takes away fifteen bits, leaving
    eleven, which is far more than anybody would ever want as
    minor opdoce for arithmetic instructions. Compare with https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
    where DEC actually left out three bits because they did not
    need them.

    What is _really_ eating up opcode space are many- (usually 16-) bit
    constants in the instructions.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Wed Apr 3 15:25:01 2024
    From Newsgroup: comp.arch

    On 4/3/2024 11:43 AM, Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the number
    of op codes.  One reason not mentioned before is if you have fixed
    length instructions, you may want to leave as many codes as possible available for future use.  Of course, if you are doing a 16-bit
    instruction design, where instruction bits are especially tight, you may save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s
    use of tags in registers, but not memory.  I worked through this idea
    using the My 6600 as an example “substrate” for two reasons.  First, it has several features that are “friendly” to the idea.  Second, I know Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea.  It is
    certainly not fully worked out.  I present it here to stimulate discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.  If set, the bit indicates that the corresponding register contains a floating-point value.  Clear indicates not floating point (integer, address, etc.).  There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register.  Non-floating-point loads would clear the
    tag bit.  As I show below, I don’t think you need any special "store
    tag" instructions.

    When executing arithmetic instructions, if the tag bits of both sources
    of an instruction are the same, do the appropriate operation (floating
    or integer), and set the tag bit of the result register appropriately.
    If the tag bits of the two sources are different, I see several possibilities.

    1.    Generate an exception.
    2.    Use the sense of source 1 for the arithmetic operation, but
    perform the appropriate conversion on the second operand first,
    potentially saving an instruction
    3.    Always do the operation in floating point and convert the integer operand prior to the operation.  (Or, if you prefer, change floating
    point to integer in the above description.)
    4.    Same as 2 or 3 above, but don’t do the conversions.

    I suspect this is the least useful choice.  I am not sure which is the
    best option.

    Given that, use the same op code for the floating-point and fixed
    versions of the same operations.  So we can save eight op codes, the
    four arithmetic operations, max, min, abs and compare.  So far, a net savings of six opcodes.

    But we can go further.  There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions.  And there are some operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts.  Given the tag bit, these could share the same op-code.  There may be several more of these.

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data.  But what happens with separate compilations?  The called function probably doesn’t know the
    tag value for callee saved registers.  Fortunately, the My 66000 architecture comes to the rescue here.  You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost).  The same mechanism works for interrupts that take control away from a running process.

    I don’t think you need to set or clear the tag bits without doing
    anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code.   For
    example, Oring a register with itself could be used to set the tag bit
    and Oring a register with zero could clear it.  These should be pretty rare.

    That is as far as I got.  I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad.  Is it
    worth it?  To me, a major question is the effect on performance.  What
    is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use?  If it causes an
    extra cycle per instruction, then it is almost certainly not worth it. IANAHG, so I don’t know.  But even if it doesn’t cost any performance, I
    think the overall gains are pretty small, and probably not worth it
    unless the op-code space is really tight (which, for My 66000 it isn’t).

    Anyway, it has been fun thinking about this, so I hope you don’t mind
    the, probably too long, post.
    Any comments are welcome.



    FWIW:
    This doesn't seem too far off from what would be involved with dynamic
    typing at the ISA level, but with many of same sorts of drawbacks...



    Say, for example, top 2 bits of a register:
    00: Object Reference
    Next 2 bits:
    00: Pointer (with type-tag)
    01: ?
    1z: Bounded Array
    01: Fixnum (route to ALU)
    10: Flonum (route to FPU)
    11: Other types
    00: Smaller value types
    Say: int/uint, short/ushort, ...
    ...

    One issue:
    Decoding based on register tags would mean needing to know the register
    tag bits at the same time the instruction is being decoded. In this
    case, one is likely to need two clock-cycles to fully decode the opcode.

    ID1: Unpack instruction to figure out register fields, etc.
    ID2: Fetch registers, specialize variable instructions based on tag bits.

    For timing though, one ideally doesn't want to do anything with the
    register values until the EX stages (since ID2 might already be tied up
    with the comparably expensive register-forwarding logic), but asking for
    3 cycles for decode is a bit much.

    Otherwise, if one does not know which FU should handle the operation
    until EX1, this has its own issues. Or, possible, the FU's decide
    whether to accept the operation:
    ALU: Accepts operation if both are fixnum, FPU if both are Flonum.

    But, a proper dynamic language allows mixing fixnum and flonum with the
    result being implicitly converted to flonum, but from the FPU's POV,
    this would effectively require two chained FADD operations (one for the
    Fixnum to Flonum conversion, one for the FADD itself).

    Many other cases could get hairy, but to have any real benefit, the CPU
    would need to be able to deal with them. In cases where the compiler
    deals with everything, the type-tags become mostly moot (or potentially detrimental).


    But, then, there is another issue:
    C code expects C type semantics to be respected, say:
    Signed int overflow wraps at 32 bits (sign extending);
    Unsigned int overflow wraps at 32 bits (zero extending);
    Variables may not hold values out-of-range for that type;
    The 'long long' and 'unsigned long long' types are exactly 64-bit;
    ...
    ...

    If one has tagged 64-bit registers, then fixnum might not hold the
    entire range of 'long long'. If one has 66 or 68 bit registers, then
    memory storage is a problem.

    If one has untagged registers for cases where they are needed, one has
    not saved any encoding space.

    And, if one type-tags statically-typed variables, there no real
    "value-added" here (and saving a little encoding space at the cost of
    making the rest of the CPU more complicated and expensive, isn't much of
    a win).

    Better as I see it, to leave the CPU itself mostly working with raw
    untagged values.


    It can make sense to have helper-ops for type-tags, but these don't save
    any encoding space, but rather making cases for dealing type-tagged data
    a little faster.

    Say:
    Sign-extending a fixnum to 64 bits;
    Setting the tag bits for a fixnum;
    Doing the twiddling to convert between Flonum and Double;
    Setting the tag for various bit patterns;
    Checking the tag(s) against various bit patterns;
    ...

    Where, on a more traditional ISA, the logic to do the bit-twiddling for type-checking and tag modification are a significant part of the runtime
    cost of a dynamically typed language.

    With luck, one can have dynamic typing that isn't horribly slow.
    But, one still isn't likely to see serious use of dynamic typing in systems-level programming (if anything, Haskell style type-systems seem
    to be more in fashion in this space at present, where trying to get the
    code to be accepted by the compiler is itself an exercise in pain).

    Well, and those of use who would prefer a more ActionScript or Haxe like approach in a systems-level language (at least as an option for when it
    is useful to do so) are likely kind of the minority.

    Well, and having a C dialect where one can be like, say:
    __variant obj;
    obj = __var { .x=3, .y=4 }; //ex-nihilo object
    obj.z=5; //implicitly creates a new 'z' field in the object.
    Is, not exactly standard...

    And, I end up limiting its use, as any code which touches this stuff can
    only be compiled in BGBCC (for example, getting the TestKern core to
    build in GCC ended up needing to disable things like the BASIC
    interpreter and similar, as I had used some of these features to
    implement the interpreter).


    Though, personally I still prefer being a little more strict than JS/ES
    in some areas, like any of ""==null or 0==null or 0=="" or null=="null"
    or similar being true, is a misfeature as far as I am concerned (my
    original scripting language had quietly dropped this feature, despite otherwise resembling JS/ES).


    Though, in my case, the ISA is not tagged, so all of this stuff is built
    on top of implicit runtime calls. There is not currently a garbage
    collector, but adding stuff to support precise GC could be possible in
    theory (and in my current project would be a tempting alternative to the
    use of conservative GC, if albeit likely neither could be used in a
    real-time context).

    In one of my own languages, I had instead defined rules to try to allow
    the compiler to figure out object lifetimes in various cases (how an
    object is created implicitly also gave some information about its
    semantics and lifetime).

    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 3 21:30:02 2024
    From Newsgroup: comp.arch

    BGB-Alt wrote:

    On 4/3/2024 11:43 AM, Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the number
    of op codes.  One reason not mentioned before is if you have fixed
    length instructions, you may want to leave as many codes as possible
    available for future use.  Of course, if you are doing a 16-bit
    instruction design, where instruction bits are especially tight, you may
    save enough op-codes to save a bit, perhaps allowing a larger register
    specifier field, or to allow more instructions in the smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s
    use of tags in registers, but not memory.  I worked through this idea
    using the My 6600 as an example “substrate” for two reasons.  First, it
    66000
    has several features that are “friendly” to the idea.  Second, I know >> Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea.  It is
    certainly not fully worked out.  I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.  If >> set, the bit indicates that the corresponding register contains a
    floating-point value.  Clear indicates not floating point (integer,
    address, etc.).  There would be two additional instructions, load single >> floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register.  Non-floating-point loads would clear the
    tag bit.  As I show below, I don’t think you need any special "store
    tag" instructions.

    What do you do when you want a FP bit pattern interpreted as an integer,
    or vice versa.

    When executing arithmetic instructions, if the tag bits of both sources
    of an instruction are the same, do the appropriate operation (floating
    or integer), and set the tag bit of the result register appropriately.
    If the tag bits of the two sources are different, I see several
    possibilities.

    1.    Generate an exception.
    2.    Use the sense of source 1 for the arithmetic operation, but
    perform the appropriate conversion on the second operand first,
    potentially saving an instruction

    Conversions to/from FP often require a rounding mode. How do you specify that?

    3.    Always do the operation in floating point and convert the integer >> operand prior to the operation.  (Or, if you prefer, change floating
    point to integer in the above description.)
    4.    Same as 2 or 3 above, but don’t do the conversions.

    I suspect this is the least useful choice.  I am not sure which is the
    best option.

    Given that, use the same op code for the floating-point and fixed
    versions of the same operations.  So we can save eight op codes, the
    four arithmetic operations, max, min, abs and compare.  So far, a net
    savings of six opcodes.

    But we can go further.  There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions.  And there are some
    operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts.  Given the tag bit, these could share the same
    op-code.  There may be several more of these.

    Hands waving:: "Danger Will Robinson, Danger" more waving of hands.

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data.  But what happens with
    separate compilations?  The called function probably doesn’t know the

    The compiler will certainly have a function prototype. In any event, if FP
    and Integers share a register file the lack of prototype is much less stress- full to the compiler/linking system.

    tag value for callee saved registers.  Fortunately, the My 66000
    architecture comes to the rescue here.  You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are
    saving or restoring in the same data structure it uses for the registers
    (yes, it adds 32 bits to that structure – minimal cost).  The same
    mechanism works for interrupts that take control away from a running
    process.

    Yes, but we do just fine without the tag and without the stuff mentioned above. Neither ENTER nor EXIT care about the 64-bit pattern in the register.

    I don’t think you need to set or clear the tag bits without doing
    anything else, but if you do, I think you could “repurpose” some other >> instructions to do this, without requiring another op-code.   For
    example, Oring a register with itself could be used to set the tag bit
    and Oring a register with zero could clear it.  These should be pretty
    rare.

    That is as far as I got.  I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad.  Is it
    worth it? 

    No.

    To me, a major question is the effect on performance.  What
    is the cost of having to decode the source registers and reading their
    respective tag bits before knowing which FU to use? 

    The problem is you have put decode dependent on dynamic pipeline information.
    I suggest you don't want to do that. Consider a change from int to FP instruction
    as a predicated instruction, so the pipeline cannot DECODE the instruction at hand until the predicate resolves. Yech.

    If it causes an
    extra cycle per instruction, then it is almost certainly not worth it.
    IANAHG, so I don’t know.  But even if it doesn’t cost any performance, I
    think the overall gains are pretty small, and probably not worth it
    unless the op-code space is really tight (which, for My 66000 it isn’t). >>
    Anyway, it has been fun thinking about this, so I hope you don’t mind
    the, probably too long, post.
    Any comments are welcome.

    It is actually an interesting idea if you want to limit your architecture
    to 1-wide.



    FWIW:
    This doesn't seem too far off from what would be involved with dynamic typing at the ISA level, but with many of same sorts of drawbacks...



    Say, for example, top 2 bits of a register:
    00: Object Reference
    Next 2 bits:
    00: Pointer (with type-tag)
    01: ?
    1z: Bounded Array
    01: Fixnum (route to ALU)
    10: Flonum (route to FPU)
    11: Other types
    00: Smaller value types
    Say: int/uint, short/ushort, ...
    ...

    One issue:
    Decoding based on register tags would mean needing to know the register
    tag bits at the same time the instruction is being decoded. In this
    case, one is likely to need two clock-cycles to fully decode the opcode.

    More importantly, you added a cycle AFTER register READ/Forward before
    you can start executing (more when OoO is in use).

    And finally, the compiler KNOWS what the type is at compile time.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 3 21:53:26 2024
    From Newsgroup: comp.arch

    BGB-Alt wrote:


    FWIW:
    This doesn't seem too far off from what would be involved with dynamic typing at the ISA level, but with many of same sorts of drawbacks...



    Say, for example, top 2 bits of a register:
    00: Object Reference
    Next 2 bits:
    00: Pointer (with type-tag)
    01: ?
    1z: Bounded Array
    01: Fixnum (route to ALU)
    10: Flonum (route to FPU)
    11: Other types
    00: Smaller value types
    Say: int/uint, short/ushort, ...
    ...

    So, you either have 66-bit registers, or you have 62-bit FP numbers ?!?
    This solves nobody's problems; not even LISP.

    One issue:
    Decoding based on register tags would mean needing to know the register
    tag bits at the same time the instruction is being decoded. In this
    case, one is likely to need two clock-cycles to fully decode the opcode.

    Not good. But what if you don't know the tag until the register is delivered from a latent FU, do you stall DECODE, or do you launch and make the instruction
    queue element have to deal with all outcomes.

    ID1: Unpack instruction to figure out register fields, etc.
    ID2: Fetch registers, specialize variable instructions based on tag bits.

    For timing though, one ideally doesn't want to do anything with the
    register values until the EX stages (since ID2 might already be tied up
    with the comparably expensive register-forwarding logic), but asking for
    3 cycles for decode is a bit much.

    Otherwise, if one does not know which FU should handle the operation
    until EX1, this has its own issues.

    Real-friggen-ely

    Or, possible, the FU's decide
    whether to accept the operation:
    ALU: Accepts operation if both are fixnum, FPU if both are Flonum.

    What if IMUL is performed in FMAC, IDIV in FDIV,... Int<->FP routing is
    based on calculation capability {Even CDC 6600 performed int × in the
    FP × unit (not in Thornton's book, but via conversation with 6600 logic designer at Asilomar some time ago. All they had to do to get FP × to
    perform int × was disable 1 gate.......)

    But, a proper dynamic language allows mixing fixnum and flonum with the result being implicitly converted to flonum, but from the FPU's POV,
    this would effectively require two chained FADD operations (one for the Fixnum to Flonum conversion, one for the FADD itself).

    That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
    a string to an integer and the string would be converted to int before.....

    Many other cases could get hairy, but to have any real benefit, the CPU would need to be able to deal with them. In cases where the compiler
    deals with everything, the type-tags become mostly moot (or potentially detrimental).

    You are arguing that the added complexity would somehow pay for itself.
    I can't see it paying for itself.

    But, then, there is another issue:
    C code expects C type semantics to be respected, say:
    Signed int overflow wraps at 32 bits (sign extending);
    maybe
    Unsigned int overflow wraps at 32 bits (zero extending);
    maybe
    Variables may not hold values out-of-range for that type;
    LLVM does this GCC does not.
    The 'long long' and 'unsigned long long' types are exactly 64-bit;
    At least 64-bit not exactly.
    ...
    ...

    If one has tagged 64-bit registers, then fixnum might not hold the
    entire range of 'long long'. If one has 66 or 68 bit registers, then
    memory storage is a problem.

    Ya think ?

    If one has untagged registers for cases where they are needed, one has
    not saved any encoding space.

    I give up--not worth trying to teach cosmologist why the color of the
    lipstick going on the pig is not the problem.....
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Apr 3 23:20:46 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB-Alt wrote:



    But, a proper dynamic language allows mixing fixnum and flonum with the
    result being implicitly converted to flonum, but from the FPU's POV,
    this would effectively require two chained FADD operations (one for the
    Fixnum to Flonum conversion, one for the FADD itself).

    That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
    a string to an integer and the string would be converted to int before.....

    The Burroughs B3500 would simply ignore the zone digit when adding
    a string to an integer, based on the address controller for the
    operand.

    ADD 1225 010000(UN) 020000(UA) 030000(UN)

    Would add the 12 unsigned numeric nibbles at address 10000
    to the 25 numeric digits of the 8-bit EBCDIC/ASCII data at address 20000
    and store the result as 25 numeric nibbles at address 30000.

    ADD 0507 010000(UN) 020000(UN) 030000(UA)

    Would add the 5 unsigned numeric nibbles at 10000 to
    the 7 unsigned numeric nibbles at 20000 and store them
    as 8-bit EBCDIC bytes at 30000 (inserting the zone digit @F@
    before each numeric nibble). A processor mode toggle selected
    whether the inserted zone digit should be @F@ (EBCDIC) or @3@ (ASCII).

    Likewise for SUB, INC, DEC, MPY, DIV and data movement instructions.

    The data movement instructions would left- or right-align the destination
    field (MVN (move numeric) would right justify and MVA (move alphanumeric) would left justify) when the destination and source field lengths differ.

    Floating point was BCD with an exponent sign digit, two exponent digits,
    a mantissa sign digit and a variable length mantissa of up
    to 100 digits in length. The integer instructions could be used
    on either the mantissa or exponent individually, as they were
    just fields in memory.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Apr 3 19:27:59 2024
    From Newsgroup: comp.arch

    On 4/3/2024 4:53 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    FWIW:
    This doesn't seem too far off from what would be involved with dynamic
    typing at the ISA level, but with many of same sorts of drawbacks...



    Say, for example, top 2 bits of a register:
       00: Object Reference
         Next 2 bits:
           00: Pointer (with type-tag)
           01: ?
           1z: Bounded Array
       01: Fixnum (route to ALU)
       10: Flonum (route to FPU)
       11: Other types
         00: Smaller value types
           Say: int/uint, short/ushort, ...
         ...

    So, you either have 66-bit registers, or you have 62-bit FP numbers ?!?
    This solves nobody's problems; not even LISP.


    Yeah, there is likely no way to make this worthwhile...




    One issue:
    Decoding based on register tags would mean needing to know the
    register tag bits at the same time the instruction is being decoded.
    In this case, one is likely to need two clock-cycles to fully decode
    the opcode.

    Not good. But what if you don't know the tag until the register is
    delivered from a latent FU, do you stall DECODE, or do you launch and
    make the instruction
    queue element have to deal with all outcomes.


    It is likely that the pipeline would need to stall until results are available.

    It is also likely that such a CPU would have a minimum effective latency
    of 2 or 3 clock cycles for *every* instruction (and probably 4 or 5
    cycles for memory load), in addition to requiring pipeline stalls.


    ID1: Unpack instruction to figure out register fields, etc.
    ID2: Fetch registers, specialize variable instructions based on tag bits.

    For timing though, one ideally doesn't want to do anything with the
    register values until the EX stages (since ID2 might already be tied
    up with the comparably expensive register-forwarding logic), but
    asking for 3 cycles for decode is a bit much.

    Otherwise, if one does not know which FU should handle the operation
    until EX1, this has its own issues.

    Real-friggen-ely


    These issues could be a deal-breaker for such a CPU.


                                        Or, possible, the FU's decide
    whether to accept the operation:
       ALU: Accepts operation if both are fixnum, FPU if both are Flonum.

    What if IMUL is performed in FMAC, IDIV in FDIV,... Int<->FP routing is
    based on calculation capability {Even CDC 6600 performed int × in the FP
    × unit (not in Thornton's book, but via conversation with 6600 logic designer at Asilomar some time ago. All they had to do to get FP × to perform int × was disable 1 gate.......)


    Then you have a mess...

    So, probably need to sort it out before EX in any case.


    But, a proper dynamic language allows mixing fixnum and flonum with
    the result being implicitly converted to flonum, but from the FPU's
    POV, this would effectively require two chained FADD operations (one
    for the Fixnum to Flonum conversion, one for the FADD itself).

    That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
    a string to an integer and the string would be converted to int before.....


    If you have dynamic types in hardware in this way, then effectively the typesystem mechanics switch from being a language issue to a hardware issue.


    One may also end up with, say, a CPU that can run Scheme or JavaScript
    or similar, but likely couldn't run C without significant hassles.



    Many other cases could get hairy, but to have any real benefit, the
    CPU would need to be able to deal with them. In cases where the
    compiler deals with everything, the type-tags become mostly moot (or
    potentially detrimental).

    You are arguing that the added complexity would somehow pay for itself.
    I can't see it paying for itself.


    One either goes all in, or abandons the idea entirely.
    There isn't really a middle option in this scenario (then one just ends
    up with something that is bad at everything).

    I was not saying it could work, but in a way, pointing out the issues
    that would likely make this unworkable.


    Though, that said, there could be possible merit in a CPU core that
    could run a language like ECMAScript at roughly C like speeds, even if
    it was basically unusable for pretty much anything else.

    Though, for ECMAScript, also make a case for taking the SpiderMonkey
    option and largely abandoning the use of an integer ALU (instead running
    all of the integer math through the FPU; which could be modified to
    support bitwise integer operations and similar as well).


    But, then, there is another issue:
       C code expects C type semantics to be respected, say:
         Signed int overflow wraps at 32 bits (sign extending);
    maybe
         Unsigned int overflow wraps at 32 bits (zero extending);
    maybe

    I am dealing with some code that has a bad habit of breaking if integer overflows don't happen in the expected ways (say, the ROTT engine is
    pretty bad about this one...).

    When I first started working on my ROTT port, there was also a lot of wackiness where the engine would go out of bounds, then behavior would
    depend on what other things in memory it encountered when it did so.


    I have mostly managed to fix up all the out-of-bounds issues, but this
    isn't enough to keep the demo's from desyncing (a similar issue applies
    with my Doom port).

    Apparently, other engines like ZDoom and similar needed to do a bit of
    "heavy lifting" to get the demos from all of the various WAD versions to
    play without desync; as Doom was also dependent on the behavior of out-of-bounds memory accesses, and it was needed to turn these into
    in-bounds accesses (to larger memory objects) with the memory contents
    of the out-of-bounds accesses being faked.

    Of course, the other option is just to "fix" the out-of-bounds accesses,
    and live with a port where the demo playback desyncs.



    Meanwhile, Quake entirely avoided this issue:
    The demo playback is based on recording the location and orientation of
    the player and any enemies at every point in time and similar, rather
    than based on recording and replaying the original sequence of keyboard
    inputs (and assuming that everything always happens exactly the same
    each time).


    Then again, these sorts of issues are not unique to these games. Have
    watched more than a few speed-runs involving using glitches either to
    leave the playable parts of the map, or using convoluted sequences of
    actions to corrupt memory in such a way as to achieve a desired effect
    (such as triggering a warp to the end of the game).

    Like, during normal gameplay, these games are seemingly just sorta
    corrupting memory all over the place but, for the most part, no one
    notices until something goes more obviously wrong...


         Variables may not hold values out-of-range for that type;
    LLVM does this GCC does not.
         The 'long long' and 'unsigned long long' types are exactly 64-bit;
    At least 64-bit not exactly.

    C only requires at-least 64 bits.
    I suspect in-practice, most code expects exactly 64 bits.


           ...
         ...

    If one has tagged 64-bit registers, then fixnum might not hold the
    entire range of 'long long'. If one has 66 or 68 bit registers, then
    memory storage is a problem.

    Ya think ?


    Both options suck, granted.


    If one has untagged registers for cases where they are needed, one has
    not saved any encoding space.

    I give up--not worth trying to teach cosmologist why the color of the lipstick going on the pig is not the problem.....


    I was not trying to claim that this idea wouldn't suck.


    In my case, I went a different route that works a little better:
    Leaving all this stuff mostly up to software...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Apr 4 10:32:48 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 4/3/2024 11:43 AM, Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the
    number of op codes.  One reason not mentioned before is if you have
    fixed length instructions, you may want to leave as many codes as
    possible available for future use.  Of course, if you are doing a
    16-bit instruction design, where instruction bits are especially
    tight, you may save enough op-codes to save a bit, perhaps allowing a
    larger register specifier field, or to allow more instructions in the
    smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s >>> use of tags in registers, but not memory.  I worked through this idea
    using the My 6600 as an example “substrate” for two reasons.  First, it
                   66000
    has several features that are “friendly” to the idea.  Second, I know >>> Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea.  It is
    certainly not fully worked out.  I present it here to stimulate >>> discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.
    If set, the bit indicates that the corresponding register contains a >>> floating-point value.  Clear indicates not floating point (integer,
    address, etc.).  There would be two additional instructions, load
    single floating and load double floating, which work the same as the >>> other 32- and 64-bit loads, but in addition to loading the value, set
    the tag bit for the destination register.  Non-floating-point loads
    would clear the tag bit.  As I show below, I don’t think you need any >>> special "store tag" instructions.

    What do you do when you want a FP bit pattern interpreted as an integer,
    or vice versa.
    This is why, if you want to copy Mill, you have to do it properly:
    Mill does NOT care about the type of data loaded into a particular belt
    slot, only the size and if it is a scalar or a vector filling up the
    full belt slot. In either case you will also have marker bits for
    special types like None and NaR.
    So scalar 8/16/32/64/128 and vector 8x16/16x8/32x4/64x2/128x1 (with the
    last being the same as the scalar anyway).
    Only load ops and explicit widening/narrowing ops sets the size tag
    bits, from that point any op where it makes sense will do the right
    thing for either a scalar or a short vector, so you can add 16+16 8-bit
    vars with the same ADD encoding as you would use for a single 64-bit ADD.We do NOT make any attempt to interpret the actual bit patterns sotred
    within each belt slot, that is up to the instructions. This means that
    there is no difference between loading a float or an int32_t, it also
    means that it is perfectly legel (and supported) to use bit operations
    on a FP variable. This can be very useful, not just to fake exact
    arithmetic by splitting a double into two 26-bit mantissa parts:
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Apr 4 16:47:44 2024
    From Newsgroup: comp.arch

    On Thu, 4 Apr 2024 10:32:48 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    We do NOT make any attempt

    Terje


    Does a present tense means that you are still involved in Mill project?



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Apr 4 21:13:21 2024
    From Newsgroup: comp.arch

    Michael S wrote:
    On Thu, 4 Apr 2024 10:32:48 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    We do NOT make any attempt

    Terje


    Does a present tense means that you are still involved in Mill project?

    I am much less active than I used to be, but I still get the weekly conf
    call invites and respond to any interesting subject on our mailing list.

    So, yes, I do consider myself to still be involved.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Apr 4 22:25:30 2024
    From Newsgroup: comp.arch

    On Thu, 4 Apr 2024 21:13:21 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Thu, 4 Apr 2024 10:32:48 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    We do NOT make any attempt

    Terje


    Does a present tense means that you are still involved in Mill
    project?
    I am much less active than I used to be, but I still get the weekly
    conf call invites and respond to any interesting subject on our
    mailing list.

    So, yes, I do consider myself to still be involved.

    Terje


    Thank you

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Thu Apr 4 17:28:43 2024
    From Newsgroup: comp.arch

    On 4/4/2024 3:32 AM, Terje Mathisen wrote:
    MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 4/3/2024 11:43 AM, Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the
    number of op codes.  One reason not mentioned before is if you have
    fixed length instructions, you may want to leave as many codes as
    possible available for future use.  Of course, if you are doing a
    16-bit instruction design, where instruction bits are especially
    tight, you may save enough op-codes to save a bit, perhaps allowing
    a larger register specifier field, or to allow more instructions in
    the smaller subset.

    It is in this spirit that I had an idea, partially inspired by
    Mill’s use of tags in registers, but not memory.  I worked through >>>> this idea using the My 6600 as an example “substrate” for two
    reasons.  First, it
                    66000
    has several features that are “friendly” to the idea.  Second, I >>>> know Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea.  It is
    certainly not fully worked out.  I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.
    If set, the bit indicates that the corresponding register contains a
    floating-point value.  Clear indicates not floating point (integer,
    address, etc.).  There would be two additional instructions, load
    single floating and load double floating, which work the same as the
    other 32- and 64-bit loads, but in addition to loading the value,
    set the tag bit for the destination register.  Non-floating-point
    loads would clear the tag bit.  As I show below, I don’t think you >>>> need any special "store tag" instructions.

    What do you do when you want a FP bit pattern interpreted as an
    integer, or vice versa.

    This is why, if you want to copy Mill, you have to do it properly:

    Mill does NOT care about the type of data loaded into a particular belt slot, only the size and if it is a scalar or a vector filling up the
    full belt slot. In either case you will also have marker bits for
    special types like None and NaR.

    So scalar 8/16/32/64/128 and vector 8x16/16x8/32x4/64x2/128x1 (with the
    last being the same as the scalar anyway).

    Only load ops and explicit widening/narrowing ops sets the size tag
    bits, from that point any op where it makes sense will do the right
    thing for either a scalar or a short vector, so you can add 16+16 8-bit
    vars with the same ADD encoding as you would use for a single 64-bit ADD.

    We do NOT make any attempt to interpret the actual bit patterns sotred within each belt slot, that is up to the instructions. This means that
    there is no difference between loading a float or an int32_t, it also
    means that it is perfectly legel (and supported) to use bit operations
    on a FP variable. This can be very useful, not just to fake exact
    arithmetic by splitting a double into two 26-bit mantissa parts:


    I guess useful to know.

    Haven't heard much about Mill in a while, so don't know what if any
    progress is being made.


    As I can note, in my actual ISA, any type-tagging in the registers was explicit and opt-in, generally managed by the compiler/runtime/etc; in
    this case, the ISA merely providing facilities to assist with this.


    The main exception would likely have been the possible "Bounds Check
    Enforce" mode, which would still need a bit of work to implement, and is
    not likely to be terribly useful. Most complicated and expensive parts
    are that it will require implicit register and memory tagging (to flag capabilities). Though, cheaper option is simply to not enable it, in
    which case things either behave as before, with the new functionality essentially being NOP. Much of the work still needed on this would be
    getting the 128-bit ABI working, and adding some new tweaks to the ABI
    to play well with the capability addressing (effectively it requires
    partly reworking how global variables are accessed).


    The type-tagging scheme used in my case is very similar to that used in
    my previous BGBScript VMs (where, as I can note, BGBCC was itself a fork
    off of an early version of the BGBScript VM, and effectively using a lax hybrid typesystem masquerading as C). Though, it has long since moved to
    a more proper C style typesystem, with dynamic types more as an optional extension.


    But, as can be noted, since dynamic typing is implemented via runtime
    calls, it is slower than the use of static types. But, this is likely to
    be unavoidable with any kind of conventional-ish architecture (and, some structures, like bounded array objects and ex-nihilo objects, are
    difficult to make performance competitive with bare pointers and structs).

    Though, it is not so much that I think it is justifiable to forbid their existence entirely (as is more the philosophy in many strict static languages), or to mandate that programs roll their own (as typical in C
    and C++ land). Where, with compiler and runtime support, it is possible
    to provide them in ways that are higher performance than a plain C implementation.

    Well, and also the annoyance that seemingly every dynamic-language VM
    takes a different approach to the implementation of its dynamic
    typesystem (along with language differences, ...).

    For example, Common Lisp is very different from Smalltalk, despite both
    being categorically similar in this sense (or, versus Python, or versus JavaScript, or, ...). Not likely viable to address all of them in the
    same runtime (and would likely result in a typesystem that doesn't
    really match with any of them, ...).


    Though, annoyingly, there are not really any mainstream languages in the "hybrid" category (say, in the gray area between C and ActionScript).
    And, then it ends up being a question of which is better in a choice
    between C with AS-like features, or "like AS but with C features".

    So, alas...


    Terje


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 5 01:48:33 2024
    From Newsgroup: comp.arch

    BGB-Alt wrote:

    On 4/4/2024 3:32 AM, Terje Mathisen wrote:
    MitchAlsup1 wrote:


    As I can note, in my actual ISA, any type-tagging in the registers was explicit and opt-in, generally managed by the compiler/runtime/etc; in
    this case, the ISA merely providing facilities to assist with this.


    The main exception would likely have been the possible "Bounds Check Enforce" mode, which would still need a bit of work to implement, and is
    not likely to be terribly useful.

    A while back (and maybe in the future) My 66000 had what I called the
    Foreign Access Mode. When the HoB of the pointer was set, the first
    entry in the translation table was a 4 doubleword structure, A Root
    pointer, the Lowest addressable Byte, the Highest addressable Byte,
    and a DW of access rights, permissions,... While sort-of like a capability
    I don't think it was close enough to actually be a capability or used as
    one.

    So, it fell out of favor, and it was not clear how it fit into the HyperVisor/SuperVisor model, either.

    Most complicated and expensive parts
    are that it will require implicit register and memory tagging (to flag capabilities). Though, cheaper option is simply to not enable it, in
    which case things either behave as before, with the new functionality essentially being NOP. Much of the work still needed on this would be getting the 128-bit ABI working, and adding some new tweaks to the ABI
    to play well with the capability addressing (effectively it requires
    partly reworking how global variables are accessed).


    The type-tagging scheme used in my case is very similar to that used in
    my previous BGBScript VMs (where, as I can note, BGBCC was itself a fork
    off of an early version of the BGBScript VM, and effectively using a lax hybrid typesystem masquerading as C). Though, it has long since moved to
    a more proper C style typesystem, with dynamic types more as an optional extension.

    In general, any time one needs to change the type you waste an instruction compared to type less registers.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Thu Apr 4 21:13:13 2024
    From Newsgroup: comp.arch

    On some older CPUs, there might be one set of integer opcodes and one
    set of floating-point opcodes, with a status register containing the
    integer precision, and the floating-point precision, currently in use.

    The idea was that this would be efficient because most programs only
    use one size of each type of number, so the number of opcodes would be
    the most appropriate, and that status register wouldn't need to be
    reloaded too often.

    It's considered dangerous, though, to have a mechanism for changing
    what instructions mean, since this could let malware alter what
    programs do in a useful and sneaky fashion. Memory bandwidth is no
    longer a crippling constraint the way it was back in the days of core
    memory and discrete transistors - at least not for program code, even
    if memory bandwidth for _data_ often limits the processing speed of
    computers.

    This is basically because any program that does any real work, taking
    any real length of time to do its job, is going to mostly consist of
    loops that fit in cache. So letting program code be verbose if there
    are other benefits obtained thereby is the current conventional
    wisdom.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Fri Apr 5 00:54:54 2024
    From Newsgroup: comp.arch

    On 4/4/2024 8:48 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 4/4/2024 3:32 AM, Terje Mathisen wrote:
    MitchAlsup1 wrote:


    As I can note, in my actual ISA, any type-tagging in the registers was
    explicit and opt-in, generally managed by the compiler/runtime/etc; in
    this case, the ISA merely providing facilities to assist with this.


    The main exception would likely have been the possible "Bounds Check
    Enforce" mode, which would still need a bit of work to implement, and
    is not likely to be terribly useful.

    A while back (and maybe in the future) My 66000 had what I called the Foreign Access Mode. When the HoB of the pointer was set, the first
    entry in the translation table was a 4 doubleword structure, A Root
    pointer, the Lowest addressable Byte, the Highest addressable Byte,
    and a DW of access rights, permissions,... While sort-of like a capability
    I don't think it was close enough to actually be a capability or used as
    one.

    So, it fell out of favor, and it was not clear how it fit into the HyperVisor/SuperVisor model, either.


    Possibly true.

    The idea with BCE mode would be that the pointers would contain an
    address along with an upper and lower bound, and possibly a few access
    flags. It would disable the narrower 64-bit pointer instructions,
    forcing the use of the 128-bit pointer instructions; which would perform bounds checks, and some instructions would gain some additional semantics.

    In addition, the Boot SRAM and DRAM gain some special "Tag Bits" areas.


    However, it is unclear if the enforcing mode gains much over the normal optional bounds checking to justify the extra cost. The main "merit"
    case is that, in theory, it could offer some additional protection
    against hostile machine code (whereas the non-enforcing mode is mostly
    useful for detecting out-of-bounds memory accesses).

    However, the optional mode is compatible with the use of 64-bit pointers
    and the existing C ABI, so there is less overhead.


                                      Most complicated and expensive parts
    are that it will require implicit register and memory tagging (to flag
    capabilities). Though, cheaper option is simply to not enable it, in
    which case things either behave as before, with the new functionality
    essentially being NOP. Much of the work still needed on this would be
    getting the 128-bit ABI working, and adding some new tweaks to the ABI
    to play well with the capability addressing (effectively it requires
    partly reworking how global variables are accessed).


    The type-tagging scheme used in my case is very similar to that used
    in my previous BGBScript VMs (where, as I can note, BGBCC was itself a
    fork off of an early version of the BGBScript VM, and effectively
    using a lax hybrid typesystem masquerading as C). Though, it has long
    since moved to a more proper C style typesystem, with dynamic types
    more as an optional extension.

    In general, any time one needs to change the type you waste an instruction compared to type less registers.


    In my case, both types of values are used:
    int x; //x is a bare register
    void *p; //may or may not have tag, high 16 bits 0000 if untagged
    __variant y; //y is tagged
    auto z; //may be tagged or untagged

    Here, untagged values will generally be used for non-variant types,
    whereas tagged values for variant types.

    Here, 'auto' and 'variant' differ, in that variant says "the type is
    only known at runtime", whereas 'auto' assumes that a type exists and
    may optionally be resolved at compile time (or, alternatively, it may
    decay into variant; assumption being that one may not use auto in ways
    that are incompatible with variant). In terms of behavior, both cases
    may appear superficially similar.

    Though:
    auto z = expr;
    Would instead define 'z' as a type inferred from the expression (in a
    similar way to how it works in C++).


    Note that:
    __var x;
    Would also give a variable of type variant, but is not exactly the same ("__variant" is the type, where "__var" is a statement/expression
    keyword that just so happens to declare a variable of type "__variant"
    when used in this way).



    Say, non-variant:
    int, long, double
    void*, char*, Foo*, ...
    __m128, __vec4f, ...
    Variant:
    __variant, __object, __fixnum, __string, ...

    Where, for example:
    __variant
    May hold (nearly) any type of value at runtime.
    Though, with some semantic restrictions.
    __object
    Tagged value, like variant;
    But does not allow using operators on it directly.
    __fixnum
    Represents a 62-bit signed integer value.
    Always exists in tagged form.
    __flonum
    Represents a 62-bit floating-point value.
    Effectively a tagged Binary64 shifted-right by 2 bits.
    __string
    Holds a string;
    Essentially 'char*' but with a type-tagged pointer.
    Defaults to CP-1252 at present, but may also hold a UCS-2 string.
    Strings are assumed to be a read-only character array.
    ...


    So, say:
    int x, z;
    __variant y;

    y=x; //implicit int -> __fixnum -> __variant
    z=(int)y; //coerces y to 'int'

    There are some operators that exist for variant types but not for
    non-variant types, such as __instanceof.

    if(y __instanceof __fixnum)
    {
    //y is known to be a fixnum here
    }

    Where __instanceof can also be used on class instances:
    __class Foo __extends Bar __implements IBaz {
    ... class members ...
    };


    In theory, could add a header to #define a lot of these keywords in non-prefixed forms, in which case one could theoretically write, say:
    public class Foo extends Bar implements IBaz {
    private int x, y;
    public int someMethod()
    { return x+y; }
    public void setX(int val)
    { x=val; }
    ...
    };

    And, if one has, say:
    IBaz baz;
    ...
    if(baz instanceof Foo)
    {
    //baz is an instance of the Foo class
    }

    Though, will note that object instances are pass-by-reference here (like
    in Java and C#) and not by-value. Though, if one is familiar with Java, probably not too hard to figure out how some of this works. Also, as can
    be noted, the object model is more like Java family languages than like C++.

    However, unlike Java (and more like ActionScript), one can throw a
    'dynamic' (or '__dynamic') keyword on a class, in which case it is
    possible to create new members in the object instances merely by
    assigning to them (where any members created this way will default to
    being 'public variant').


    Object member access will differ depending on the type of object.
    Direct access to a non-dynamic class member will use a fixed
    displacement (like when accessing a struct). Dynamic members will
    implicitly access an ex-nihilo object that exists as a hidden member in
    the class instance (and using the 'dynamic' modifier on a class will implicitly create this member).

    In this case, interfaces are pulled off by sticking a interface VTable
    pointer onto the end of the object, and then encoding the Interface
    reference as a pointer to the pointer to this vtable (with the VTable
    encoding the offset to adjust the object pointer to give a pointer to
    the base class for the virtual method). Note that (unlike in the JVM),
    what interfaces a class implements is fixed at compile time ("interface injection" is not possible in BGBCC).



    There was an experimental C++ mode, which tries to mimic C++ syntax and semantics (kinda), sort of trying to awkwardly fake C++'s object system
    on top of the Java-like object system (with POD classes decaying into C structs; value objects faked with object cloning, ...). Will not take
    much to see through this illusion though (and almost doesn't really seem
    worth it).


    If ex-nihilo objects are used, these are treated as separate from the instance-of-class objects. In the current implementation, these objects
    are represented as small B-Trees representing key/value associations.
    Here, each key is a 16-bit number (associated with a "symbol") and the
    value is a 64-bit value (variant). Each object has a fixed capacity (16 members), and if exceeded, splits apart into a tree (say, a 2-level tree representing up to 256 members; with the keys in the top-level node
    encoding the ranges of keys present in each sub-node).

    At present, there is a limit of 64K unique symbols, but this isn't too
    big of an issue in practice (each symbol can be seen as a mapping
    between a 16-bit number and an ASCII string representing the symbol's name).

    If accessing a normal class member, it will be accessed as a direct
    memory load or store, or if it is a dynamic member, an implicit runtime
    call will be used.



    For dynamic types (variant), pretty much all operations involve runtime
    calls. These calls will perform a dynamically-typed dispatch based on
    the tags of the values they are given.

    Similarly, getting/setting a member in an ex-nihilo object is
    accomplished via a runtime call.

    For performance reasons, high-traffic areas of the dynamic-type runtime
    were written in ASM.


    Though, for performance, the rule here is to avoid using variant except
    in cases where one actually needs dynamic types (and based on whether or
    not compatibility with mainline C compilers is needed).


    On the other side of the language border (BS), the syntax differs slightly:
    function foo(x:int, y:int):int
    {
    var z:int;
    z=x+y;
    ...
    }

    And, another language (BS2) of mine had switched to a more Java-like
    syntax (with parts of C syntax bolted on). Though, the practical
    difference between BS2 and the extended C variant is small (if you
    #define the keywords, it is possible to do similar things with mostly
    only minor syntactic differences).


    Similarly, the extended C variant has another advantage:
    It is backwards compatible with C.

    Though, not quite so compatible with C++, and I don't expect C++ fans to
    be all that interested in BGBCC's non-standard dialect.

    OTOH: I will argue that it is at least much less horrid looking than Objective-C.

    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Fri Apr 5 14:46:35 2024
    From Newsgroup: comp.arch

    On 4/4/2024 10:13 PM, John Savard wrote:
    On some older CPUs, there might be one set of integer opcodes and one
    set of floating-point opcodes, with a status register containing the
    integer precision, and the floating-point precision, currently in use.

    The idea was that this would be efficient because most programs only
    use one size of each type of number, so the number of opcodes would be
    the most appropriate, and that status register wouldn't need to be
    reloaded too often.

    It's considered dangerous, though, to have a mechanism for changing
    what instructions mean, since this could let malware alter what
    programs do in a useful and sneaky fashion. Memory bandwidth is no
    longer a crippling constraint the way it was back in the days of core
    memory and discrete transistors - at least not for program code, even
    if memory bandwidth for _data_ often limits the processing speed of computers.

    This is basically because any program that does any real work, taking
    any real length of time to do its job, is going to mostly consist of
    loops that fit in cache. So letting program code be verbose if there
    are other benefits obtained thereby is the current conventional
    wisdom.


    This was how the FPU worked in SH-4. Reloading some bits in FPSCR would effectively bank out the current set of FPU instructions (say, between
    Single and Double, etc).


    Also it was how 64-bit operations worked in early versions of 64-bit
    versions of BJX1.

    Say. there were DQ and JQ bits added to the control register:
    DQ=0: 32-bit for variable-sized operations (like SH-4)
    DQ=1: 64-bit for variable-sized operations.
    JQ=0: 32-bit addressing (SH-4 memory map)
    JQ=1: 48-bit addressing (like the later BJX2 memory map).

    The DQ bit would also effect whether one had MOV.W or MOV.Q operations available.
    DQ=0: Interpret ops as MOV.W (16-bit)
    DQ=1: Interpret ops as MOV.Q (64-bit)

    In the DQ=JQ=0 case, it would have been mostly equivalent to SH-4 (and
    could still run GCC's compiler output). This was a similar situation to switching the FPU mode.


    Though, a later version of the BJX1 ISA had dropped and repurposed some encodings, allowing MOV.W and MOV.Q to coexist (and avoiding the need
    for the compiler to endlessly toggle this bit), albeit with fewer
    addressing modes for the latter.

    All this was an issue mostly because SH-4 had used fixed-length 16-bit instructions, and the encoding space was effectively almost entirely
    full when I started (so new instructions required either sacrificing
    existing instructions, or using mode bits).

    Though, BJX1 did end up with some 32-bit ops, some borrowed from SH-2A
    and similar. These were mostly stuck into awkward ad-hoc places in the
    16-bit map, so decoding was kind of a pain.

    ...


    When I later rebooted things as my BJX2 project, I effectively dropped
    this whole mess and started over (with the caveat that it lost SH-4 compatibility). However, it has since gained RISC-V compatibility, for better/worse, at least RISC-V is likely to get slightly better
    performance than SH-4 at least (and both ISA's can be 64-bit).

    ...


    John Savard

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 5 21:34:16 2024
    From Newsgroup: comp.arch

    John Savard wrote:

    On some older CPUs, there might be one set of integer opcodes and one
    set of floating-point opcodes, with a status register containing the
    integer precision, and the floating-point precision, currently in use.

    The idea was that this would be efficient because most programs only
    use one size of each type of number, so the number of opcodes would be
    the most appropriate, and that status register wouldn't need to be
    reloaded too often.

    Most programs I write use bytes (mostly unsigned) a few halfwords (mostly signed) a useful count of integers (both signed and unsigned--mainly as
    already defined arguments/returns), and a vast majority of doublewords (invariably unsigned).

    Early in My 66000 LLVM development Brian looked at the cost of having
    only 1 FP OpCode set--and it did not look good--so we went back to the
    standard way of an OpCode for each FP size × calculation.

    It's considered dangerous, though, to have a mechanism for changing
    what instructions mean, since this could let malware alter what
    programs do in a useful and sneaky fashion. Memory bandwidth is no
    longer a crippling constraint the way it was back in the days of core
    memory and discrete transistors - at least not for program code, even
    if memory bandwidth for _data_ often limits the processing speed of computers.

    This is basically because any program that does any real work, taking
    any real length of time to do its job, is going to mostly consist of
    loops that fit in cache. So letting program code be verbose if there
    are other benefits obtained thereby is the current conventional
    wisdom.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Sat Apr 6 21:30:47 2024
    From Newsgroup: comp.arch

    On Fri, 5 Apr 2024 21:34:16 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Early in My 66000 LLVM development Brian looked at the cost of having
    only 1 FP OpCode set--and it did not look good--so we went back to the >standard way of an OpCode for each FP size calculation.

    I do tend to agree.

    However, a silly idea has now occurred to me.

    256 bits can contain eight instructions that are 32 bits long.

    Or they can also contain seven instructions that are 36 bits long,
    with four bits left over.

    So they could contain *nine* instructions that are 28 bits long, also
    with four bits left over.

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 7 20:41:45 2024
    From Newsgroup: comp.arch

    John Savard wrote:

    On Fri, 5 Apr 2024 21:34:16 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Early in My 66000 LLVM development Brian looked at the cost of having
    only 1 FP OpCode set--and it did not look good--so we went back to the >>standard way of an OpCode for each FP size × calculation.

    I do tend to agree.

    However, a silly idea has now occurred to me.

    256 bits can contain eight instructions that are 32 bits long.

    Or they can also contain seven instructions that are 36 bits long,
    with four bits left over.

    So they could contain *nine* instructions that are 28 bits long, also
    with four bits left over.

    I agree with the arithmetic going into this statement. What I don't
    have sufficient data concerning is "whether these extra formats pay
    for themselves". For example, how many of the 36-bit encodings are
    irredundant with the 32-bit ones, and so on with the 28-bit ones.

    Take::

    ADD R7,R7,#1

    I suspect there is a 28-bit form, a 32-bit form, and a 36-bit form
    for this semantic step, that you pay for multiple times in decoding
    and possibly pipelining. {{There may also be other encodings for
    this; such as:: INC R7}}

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    How do you attach 32-bit or 64-bit constants to 28-bit instructions ??

    How do you switch from 64-bit to Byte to 32-bit to 16-bit in one
    set of 256-bit instruction decodes ??

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    In complicated if-then-else codes (and switches) I often see one inst-
    ruction followed by a branch to a common point. Does your encoding deal
    with these efficiently ?? That is:: what happens when you jump to the
    middle of a block of 36-bit instructions ??

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    Agreed.............

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 7 21:01:15 2024
    From Newsgroup: comp.arch

    John Savard <quadibloc@servername.invalid> schrieb:

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    I played around a bit with another scheme: Encoding things into
    128-bit blocks, with either 21-bit or 42-bit or longer instructions
    (or a block header with six bits, and 20 or 40 bits for each
    instruction).

    Did that look promising? Not really; the 21 bits offered a lot
    of useful opcode space for two-register operations and even for
    a few of the often-used three-register, but 42 bits was really
    a bit too long, so the advantage wasn't great. And embedding
    32-bit or 64-bit instructions in the code stream does not really
    fit the 21-bit raster well, so compared to an ISA which can do so
    (like My 66000) it came out at a disadvantage. Might be possible
    to beat RISC-V, though.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 7 21:22:50 2024
    From Newsgroup: comp.arch

    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    I played around a bit with another scheme: Encoding things into
    128-bit blocks, with either 21-bit or 42-bit or longer instructions
    (or a block header with six bits, and 20 or 40 bits for each
    instruction).

    Not having seen said encoding scheme:: I suspect you used the Rd=Rs1 destructive operand model for the 21-bit encodings. Yes :: no ??
    Otherwise one has 3×5-bit registers = 15-bits leaving only 6-bits
    for 64 OpCodes. Now if you have floats and doubles and signed and
    unsigned, you get 16 of each and we have not looked at memory
    references or branching.

    Did that look promising? Not really; the 21 bits offered a lot
    of useful opcode space for two-register operations and even for
    a few of the often-used three-register, but 42 bits was really
    a bit too long, so the advantage wasn't great. And embedding
    32-bit or 64-bit instructions in the code stream does not really
    fit the 21-bit raster well, so compared to an ISA which can do so
    (like My 66000) it came out at a disadvantage. Might be possible
    to beat RISC-V, though.

    But beating RISC-V is easy, try getting you instruction count down
    to VAX counts without losing the ability to pipeline and parallel
    instruction execution.

    At handwaving accuracy::
    VAX has 1.0 instructions
    My 66000 has 1.1 instructions
    RISC-V has 1.5 instructions
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 8 06:21:43 2024
    From Newsgroup: comp.arch

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    I played around a bit with another scheme: Encoding things into
    128-bit blocks, with either 21-bit or 42-bit or longer instructions
    (or a block header with six bits, and 20 or 40 bits for each
    instruction).

    Not having seen said encoding scheme:: I suspect you used the Rd=Rs1 destructive operand model for the 21-bit encodings. Yes :: no ??

    It was not very well developed, I gave it up when I saw there wasn't
    much to gain.

    Otherwise one has 3×5-bit registers = 15-bits leaving only 6-bits
    for 64 OpCodes.

    There could have been a case for adding this (maybe just for
    the a few frequent ones: "add r1,r2,r3", "add r1,r2,-r3", "add
    r1,r2,#num" and "add r1,r2,#-num", but I did not pursue that
    further.

    I looked at load and store instructions with short offsets
    (these would then have been scaled), and short branches. But
    the 21-bit opcode space filled up really, really rapidly.

    Also, it is easy to synthesize a 3-register operation from
    a 2-register operation and a memory move. If the decoder is
    set up for 42 bits anyway, instruction fusion also a possibility.
    This got a bit weird.

    Now if you have floats and doubles and signed and
    unsigned, you get 16 of each and we have not looked at memory
    references or branching.

    For somebody who does Fortran, I find the frequency of floating
    point instructions surprisingly low, even in Fortran code.

    Did that look promising? Not really; the 21 bits offered a lot
    of useful opcode space for two-register operations and even for
    a few of the often-used three-register, but 42 bits was really
    a bit too long, so the advantage wasn't great. And embedding
    32-bit or 64-bit instructions in the code stream does not really
    fit the 21-bit raster well, so compared to an ISA which can do so
    (like My 66000) it came out at a disadvantage. Might be possible
    to beat RISC-V, though.

    But beating RISC-V is easy, try getting you instruction count down
    to VAX counts without losing the ability to pipeline and parallel
    instruction execution.

    At handwaving accuracy::
    VAX has 1.0 instructions
    My 66000 has 1.1 instructions
    RISC-V has 1.5 instructions

    To reach VAX instruction density, one would have to have things
    like memory operands (with the associated danger that compilers
    will not put intermediate results in registers, but since they have
    been optimized for x86 for decades, they are probably better now)
    and load with update, which would then have to be cracked
    into two micro-ops. Not sure about the benefit.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Apr 8 07:16:08 2024
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    But beating RISC-V is easy, try getting you instruction count down
    to VAX counts without losing the ability to pipeline and parallel
    instruction execution.

    At handwaving accuracy::
    VAX has 1.0 instructions
    My 66000 has 1.1 instructions
    RISC-V has 1.5 instructions

    To reach VAX instruction density

    Note that in recent times Mitch Alsup ist writing not about code
    density (static code size or dynamically executed bytes), but about instrruction counts. It's unclear why instruction count would be a
    primary metric, except that he thinks that he can score points for My
    66000 with it. As VAX demonstrates, you can produce an instruction
    set with low instruction counts that is bad at the metrics that really
    count: cycles for executing the program (for a given CPU chip area in
    a given manufacturing process), and, for very small systems, static
    code size.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Mon Apr 8 07:05:35 2024
    From Newsgroup: comp.arch

    On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    How do you attach 32-bit or 64-bit constants to 28-bit instructions ??

    Yes, that's a problem. Presumably, I would have to do without
    immediates.

    An option would be to reserve some 16-bit codes to indicate a block
    consisting of one 28-bit instruction and seven 32-bit instructions,
    but that means a third instruction set.

    How do you switch from 64-bit to Byte to 32-bit to 16-bit in one
    set of 256-bit instruction decodes ??

    By using 36-bit instructions instead of 28-bit instructions.

    In complicated if-then-else codes (and switches) I often see one inst- >ruction followed by a branch to a common point. Does your encoding deal
    with these efficiently ?? That is:: what happens when you jump to the
    middle of a block of 36-bit instructions ??

    Well, when the computer fetches a 256-bit block of code, the first
    four bits indicates whether it is composed of 36-bit instructions or
    28-bit instructions. So the computer knows where the instructions are;
    and thus a convention can be applied, such as addressing each 36-bit instruction by the addresses of the first seven 32-bit positions in
    the block.

    In the case of 28-bit instructions, the first eight correspond to the
    32-bit positions, the ninth corresponds to the last 16 bits of the
    block.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 8 17:25:38 2024
    From Newsgroup: comp.arch

    John Savard <quadibloc@servername.invalid> schrieb:

    Well, when the computer fetches a 256-bit block of code, the first
    four bits indicates whether it is composed of 36-bit instructions or
    28-bit instructions.

    Do you think that instructions which require a certain size (almost)
    always happen to be situated together so they fit in a block?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Apr 8 19:56:27 2024
    From Newsgroup: comp.arch

    John Savard wrote:

    On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    In complicated if-then-else codes (and switches) I often see one inst- >>ruction followed by a branch to a common point. Does your encoding deal >>with these efficiently ?? That is:: what happens when you jump to the >>middle of a block of 36-bit instructions ??

    Well, when the computer fetches a 256-bit block of code, the first
    four bits indicates whether it is composed of 36-bit instructions or
    28-bit instructions. So the computer knows where the instructions are;
    and thus a convention can be applied, such as addressing each 36-bit instruction by the addresses of the first seven 32-bit positions in
    the block.

    So, instead of using the branch target address, one rounds it down to
    a 256-bit boundary, reads 256-bits and looks at the first 4-bits to
    determine the format, nd then uses the branch offset to pick a cont-
    tainer which will become the first instruction executed.

    Sounds more complicated than necessary.

    In the case of 28-bit instructions, the first eight correspond to the
    32-bit positions, the ninth corresponds to the last 16 bits of the
    block.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Apr 9 18:24:55 2024
    From Newsgroup: comp.arch

    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    I played around a bit with another scheme: Encoding things into
    128-bit blocks, with either 21-bit or 42-bit or longer instructions
    (or a block header with six bits, and 20 or 40 bits for each
    instruction).

    Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
    destructive operand model for the 21-bit encodings. Yes :: no ??

    It was not very well developed, I gave it up when I saw there wasn't
    much to gain.

    Maybe one more thing: In order to justify the more complex encoding,
    I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Tue Apr 9 15:01:50 2024
    From Newsgroup: comp.arch

    On 4/9/2024 1:24 PM, Thomas Koenig wrote:
    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates >>>>> 36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed >>>>> in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    I played around a bit with another scheme: Encoding things into
    128-bit blocks, with either 21-bit or 42-bit or longer instructions
    (or a block header with six bits, and 20 or 40 bits for each
    instruction).

    Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
    destructive operand model for the 21-bit encodings. Yes :: no ??

    It was not very well developed, I gave it up when I saw there wasn't
    much to gain.

    Maybe one more thing: In order to justify the more complex encoding,
    I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.


    For 32-bit instructions at least, 64 GPRs can work out OK.

    Though, the gain of 64 over 32 seems to be fairly small for most
    "typical" code, mostly bringing a benefit if one is spending a lot of
    CPU time in functions that have large numbers of local variables all
    being used at the same time.


    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
    performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and 128
    GPRs is wasteful (would likely need lots of monster functions with 250+
    local variables to make effective use of this, *, which probably isn't
    going to happen).


    *: Where, it appears it is most efficient (for non-leaf functions) if
    the number of local variables is roughly twice that of the number of CPU registers. If more local variables than this, then spill/fill rate goes
    up significantly, and if less, then the registers aren't utilized as effectively.

    Well, except in "tiny leaf" functions, where the criteria is instead
    that the number of local variables be less than the number of scratch registers. However, for many/most small leaf functions, the total number
    of variables isn't all that large either.


    Where, function categories:
    Tiny Leaf:
    Everything fits in scratch registers, no stack frame, no calls.
    Leaf:
    No function calls (either explicit or implicit);
    Will have a stack frame.
    Non-Leaf:
    May call functions, has a stack frame.


    There is a "static assign everything" case in my case, where all of the variables are statically assigned to registers (for the scope of the function). This case typically requires that everything fit into callee
    save registers, so (like the "tiny leaf" category, requires that the
    number of local variables is less than the available registers).

    On a 32 register machine, if there are 14 available callee-save
    registers, the limit is 14 variables. On a 64 register machine, this
    limit might be 30 instead. This seems to have good coverage.

    In the non-static case, the top N variables might be static-assigned,
    and the remaining variables dynamically assigned. Though, it appears
    this is more an artifact of my naive register allocator, and might not
    be as effective of a strategy with an "actually clever" register
    allocator (like those in GCC or LLVM), where purely dynamic allocation
    may be better (they are able to carry dynamic assignments across basic
    block boundaries, rather than needing to spill/fill everything whenever
    a branch or label is encountered).

    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 9 21:05:44 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/9/2024 1:24 PM, Thomas Koenig wrote:
    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Maybe one more thing: In order to justify the more complex encoding,
    I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.


    For 32-bit instructions at least, 64 GPRs can work out OK.

    Though, the gain of 64 over 32 seems to be fairly small for most
    "typical" code, mostly bringing a benefit if one is spending a lot of
    CPU time in functions that have large numbers of local variables all
    being used at the same time.


    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and 128
    GPRs is wasteful (would likely need lots of monster functions with 250+ local variables to make effective use of this, *, which probably isn't
    going to happen).

    16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part of
    GPRs AND you have good access to constants.

    *: Where, it appears it is most efficient (for non-leaf functions) if
    the number of local variables is roughly twice that of the number of CPU registers. If more local variables than this, then spill/fill rate goes
    up significantly, and if less, then the registers aren't utilized as effectively.

    Well, except in "tiny leaf" functions, where the criteria is instead
    that the number of local variables be less than the number of scratch registers. However, for many/most small leaf functions, the total number
    of variables isn't all that large either.

    The vast majority of leaf functions use less than 16 GPRs, given one has
    a SP not part of GPRs {including arguments and return values}. Once one
    starts placing things like memove(), memset(), sin(), cos(), exp(), log()
    in the ISA, it goes up even more.


    Where, function categories:
    Tiny Leaf:
    Everything fits in scratch registers, no stack frame, no calls.
    Leaf:
    No function calls (either explicit or implicit);
    Will have a stack frame.
    Non-Leaf:
    May call functions, has a stack frame.

    You are forgetting about FP, GOT, TLS, and whatever resources are required
    to do try-throw-catch stuff as demanded by the source language.

    There is a "static assign everything" case in my case, where all of the variables are statically assigned to registers (for the scope of the function). This case typically requires that everything fit into callee
    save registers, so (like the "tiny leaf" category, requires that the
    number of local variables is less than the available registers).

    On a 32 register machine, if there are 14 available callee-save
    registers, the limit is 14 variables. On a 64 register machine, this
    limit might be 30 instead. This seems to have good coverage.

    The apparent number of registers goes up when one does not waste a register
    to hold a use-once constant.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Tue Apr 9 17:47:13 2024
    From Newsgroup: comp.arch

    On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/9/2024 1:24 PM, Thomas Koenig wrote:
    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Maybe one more thing: In order to justify the more complex encoding,
    I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.


    For 32-bit instructions at least, 64 GPRs can work out OK.

    Though, the gain of 64 over 32 seems to be fairly small for most
    "typical" code, mostly bringing a benefit if one is spending a lot of
    CPU time in functions that have large numbers of local variables all
    being used at the same time.


    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
    density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
    performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and 128
    GPRs is wasteful (would likely need lots of monster functions with
    250+ local variables to make effective use of this, *, which probably
    isn't going to happen).

    16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
    of GPRs AND you have good access to constants.


    On the main ISA's I had tried to generate code for, 16 GPRs was kind of
    a pain as it resulted in fairly high spill rates.

    Though, it would probably be less bad if the compiler was able to use
    all of the registers at the same time without stepping on itself (such
    as dealing with register allocation involving scratch registers while
    also not conflicting with the use of function arguments, ...).


    My code generators had typically only used callee save registers for
    variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current
    basic-block).

    On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.


    This is slightly less bad on x86-64, since one can use memory operands
    with most instructions, and the CPU tends to deal fairly well with code
    that has lots of spill-and-fill. This along with instructions having
    access to 32-bit immediate values.


    *: Where, it appears it is most efficient (for non-leaf functions) if
    the number of local variables is roughly twice that of the number of
    CPU registers. If more local variables than this, then spill/fill rate
    goes up significantly, and if less, then the registers aren't utilized
    as effectively.

    Well, except in "tiny leaf" functions, where the criteria is instead
    that the number of local variables be less than the number of scratch
    registers. However, for many/most small leaf functions, the total
    number of variables isn't all that large either.

    The vast majority of leaf functions use less than 16 GPRs, given one has
    a SP not part of GPRs {including arguments and return values}. Once one starts placing things like memove(), memset(), sin(), cos(), exp(), log()
    in the ISA, it goes up even more.


    Yeah.

    Things like memcpy/memmove/memset/etc, are function calls in cases when
    not directly transformed into register load/store sequences.

    Did end up with an intermediate "memcpy slide", which can handle medium
    size memcpy and memset style operations by branching into a slide.



    As noted, on a 32 GPR machine, most leaf functions can fit entirely in
    scratch registers. On a 64 GPR machine, this percentage is slightly
    higher (but, not significantly, since there are few leaf functions
    remaining at this point).


    If one had a 16 GPR machine with 6 usable scratch registers, it is a
    little harder though (as typically these need to cover both any
    variables used by the function, and any temporaries used, ...). There
    are a whole lot more leaf functions that exceed a limit of 6 than of 14.

    But, say, a 32 GPR machine could still do well here.


    Note that there are reasons why I don't claim 64 GPRs as a large
    performance advantage:
    On programs like Doom, the difference is small at best.


    It mostly effects things like GLQuake in my case, mostly because TKRA-GL
    has a lot of functions with a large numbers of local variables (some
    exceeding 100 local variables).

    Partly though this is due to code that is highly inlined and unrolled
    and uses lots of variables tending to perform better in my case (and
    tightly looping code, with lots of small functions, not so much...).



    Where, function categories:
       Tiny Leaf:
         Everything fits in scratch registers, no stack frame, no calls.
       Leaf:
         No function calls (either explicit or implicit);
         Will have a stack frame.
       Non-Leaf:
         May call functions, has a stack frame.

    You are forgetting about FP, GOT, TLS, and whatever resources are required
    to do try-throw-catch stuff as demanded by the source language.


    Yeah, possibly true.

    In my case:
    There is no frame pointer, as BGBCC doesn't use one;
    All stack-frames are fixed size, VLA's and alloca use the heap;
    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
    TLS, accessed via TBR.

    Try/throw/catch:
    Mostly N/A for leaf functions.

    Any function that can "throw", is in effect no longer a leaf function. Implicitly, any function which uses "variant" or similar is also, no
    longer a leaf function.

    Need for GBR save/restore effectively excludes a function from being tiny-leaf. This may happen, say, if a function accesses global variables
    and may be called as a function pointer.


    There is a "static assign everything" case in my case, where all of
    the variables are statically assigned to registers (for the scope of
    the function). This case typically requires that everything fit into
    callee save registers, so (like the "tiny leaf" category, requires
    that the number of local variables is less than the available registers).

    On a 32 register machine, if there are 14 available callee-save
    registers, the limit is 14 variables. On a 64 register machine, this
    limit might be 30 instead. This seems to have good coverage.

    The apparent number of registers goes up when one does not waste a register to hold a use-once constant.

    Possibly true. In the "static assign everything" case, each constant
    used is also assigned a register.


    One "TODO" here would be to merge constants with the same "actual" value
    into the same register. At present, they will be duplicated if the types
    are sufficiently different (such as integer 0 vs NULL).

    For functions with dynamic assignment, immediate values are more likely
    to be used. If the code-generator were clever, potentially it could
    exclude assigning registers to constants which are only used by
    instructions which can encode them directly as an immediate. Currently,
    BGBCC is not that clever.

    Or, say:
    y=x+31; //31 only being used here, and fits easily in an Imm9.
    Ideally, compiler could realize 31 does not need a register here.


    Well, and another weakness is with temporaries that exist as function arguments:
    If static assigned, the "target variable directly to argument register" optimization can't be used (it ends up needing to go into a callee-save register and then be MOV'ed into the argument register; otherwise the
    compiler breaks...).

    Though, I guess possible could be that the compiler could try to
    partition temporaries that are used exclusively as function arguments
    into a different category from "normal" temporaries (or those whose
    values may cross a basic-block boundary), and then avoid
    statically-assigning them (and somehow not cause this to effectively
    break the full-static-assignment scheme in the process).

    Though, IIRC, I had also considered the possibility of a temporary
    "virtual assignment", allowing the argument value to be temporarily
    assigned to a function argument register, then going "poof" and
    disappearing when the function is called. Hadn't yet thought of a good
    way to add this logic to the register allocator though.


    But, yeah, compiler stuff is really fiddly...



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 10 00:28:02 2024
    From Newsgroup: comp.arch

    BGB-Alt wrote:

    On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
    BGB wrote:

    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
    density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
    performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and 128
    GPRs is wasteful (would likely need lots of monster functions with
    250+ local variables to make effective use of this, *, which probably
    isn't going to happen).

    16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
    of GPRs AND you have good access to constants.


    On the main ISA's I had tried to generate code for, 16 GPRs was kind of
    a pain as it resulted in fairly high spill rates.

    Though, it would probably be less bad if the compiler was able to use
    all of the registers at the same time without stepping on itself (such
    as dealing with register allocation involving scratch registers while
    also not conflicting with the use of function arguments, ...).


    My code generators had typically only used callee save registers for variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current basic-block).

    On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.


    This is slightly less bad on x86-64, since one can use memory operands
    with most instructions, and the CPU tends to deal fairly well with code
    that has lots of spill-and-fill. This along with instructions having
    access to 32-bit immediate values.

    Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have LD-Ops act as if they have 4-6 more registers than they really have. x86
    with 16 GPRs acts like a RISC with 20-24 GPRs as does 360. Does not really
    take the place of universal constants, but goes a long way.


    The vast majority of leaf functions use less than 16 GPRs, given one has
    a SP not part of GPRs {including arguments and return values}. Once one
    starts placing things like memove(), memset(), sin(), cos(), exp(), log()
    in the ISA, it goes up even more.


    Yeah.

    Things like memcpy/memmove/memset/etc, are function calls in cases when
    not directly transformed into register load/store sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a single inst- ruction.

    Did end up with an intermediate "memcpy slide", which can handle medium
    size memcpy and memset style operations by branching into a slide.

    MMs and MSs that do not cross page boundaries are ATOMIC. The entire system sees only the before or only the after state and nothing in between. This
    means one can start (queue up) a SATA disk access without obtaining a lock
    to the device--simply because one can fill in all the data of a command in
    a single instruction which smells ATOMIC to all interested 3rd parties.

    As noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers.

    Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting totally screwed.

    On a 64 GPR machine, this percentage is slightly
    higher (but, not significantly, since there are few leaf functions
    remaining at this point).


    If one had a 16 GPR machine with 6 usable scratch registers, it is a
    little harder though (as typically these need to cover both any
    variables used by the function, and any temporaries used, ...). There
    are a whole lot more leaf functions that exceed a limit of 6 than of 14.

    The data back in the R2000-3000 days indicated that 32 GPRs has a 15%+ advantage over a 16 GPRs; while 84 had only a 3% advantage.

    But, say, a 32 GPR machine could still do well here.


    Note that there are reasons why I don't claim 64 GPRs as a large
    performance advantage:
    On programs like Doom, the difference is small at best.


    It mostly effects things like GLQuake in my case, mostly because TKRA-GL
    has a lot of functions with a large numbers of local variables (some exceeding 100 local variables).

    Partly though this is due to code that is highly inlined and unrolled
    and uses lots of variables tending to perform better in my case (and
    tightly looping code, with lots of small functions, not so much...).



    Where, function categories:
       Tiny Leaf:
         Everything fits in scratch registers, no stack frame, no calls. >>>    Leaf:
         No function calls (either explicit or implicit);
         Will have a stack frame.
       Non-Leaf:
         May call functions, has a stack frame.

    You are forgetting about FP, GOT, TLS, and whatever resources are required >> to do try-throw-catch stuff as demanded by the source language.


    Yeah, possibly true.

    In my case:
    There is no frame pointer, as BGBCC doesn't use one;

    Can't do PASCAL and other ALOGO derived languages with block structure.

    All stack-frames are fixed size, VLA's and alloca use the heap;

    longjump() is at a serious disadvantage here.
    desctructors are sometimes hard to position on the stack.

    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
    TLS, accessed via TBR.

    Try/throw/catch:
    Mostly N/A for leaf functions.

    Any function that can "throw", is in effect no longer a leaf function. Implicitly, any function which uses "variant" or similar is also, no
    longer a leaf function.

    You do realize that there is a set of #define-s that can implement try-throw-catch without requiring any subroutines ?!?

    Need for GBR save/restore effectively excludes a function from being tiny-leaf. This may happen, say, if a function accesses global variables
    and may be called as a function pointer.

    ------------------------------------------------------

    One "TODO" here would be to merge constants with the same "actual" value into the same register. At present, they will be duplicated if the types
    are sufficiently different (such as integer 0 vs NULL).

    In practice, the upper 48-bits of a extern variable is completely shared whereas the lower 16-bits are unique.

    For functions with dynamic assignment, immediate values are more likely
    to be used. If the code-generator were clever, potentially it could
    exclude assigning registers to constants which are only used by
    instructions which can encode them directly as an immediate. Currently, BGBCC is not that clever.

    And then there are languages like PL/1 and FORTRAN where the compiler
    has to figure out how big an intermediate array is, allocate it, perform
    the math, and then deallocate it.

    Or, say:
    y=x+31; //31 only being used here, and fits easily in an Imm9.
    Ideally, compiler could realize 31 does not need a register here.


    Well, and another weakness is with temporaries that exist as function arguments:
    If static assigned, the "target variable directly to argument register" optimization can't be used (it ends up needing to go into a callee-save register and then be MOV'ed into the argument register; otherwise the compiler breaks...).

    Though, I guess possible could be that the compiler could try to
    partition temporaries that are used exclusively as function arguments
    into a different category from "normal" temporaries (or those whose
    values may cross a basic-block boundary), and then avoid statically-assigning them (and somehow not cause this to effectively
    break the full-static-assignment scheme in the process).

    Brian's compiler finds the largest argument list and the largest return
    value list and merges them into a single area on the stack used only
    for passing arguments and results across the call interface. And the
    <static> SP points at this area.

    Though, IIRC, I had also considered the possibility of a temporary
    "virtual assignment", allowing the argument value to be temporarily
    assigned to a function argument register, then going "poof" and
    disappearing when the function is called. Hadn't yet thought of a good
    way to add this logic to the register allocator though.


    But, yeah, compiler stuff is really fiddly...


    More orthogonality helps.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Apr 9 22:01:00 2024
    From Newsgroup: comp.arch

    On 4/9/2024 3:47 PM, BGB-Alt wrote:
    On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/9/2024 1:24 PM, Thomas Koenig wrote:
    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Maybe one more thing: In order to justify the more complex encoding,
    I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.


    For 32-bit instructions at least, 64 GPRs can work out OK.

    Though, the gain of 64 over 32 seems to be fairly small for most
    "typical" code, mostly bringing a benefit if one is spending a lot of
    CPU time in functions that have large numbers of local variables all
    being used at the same time.


    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
    code density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
    performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and 128
    GPRs is wasteful (would likely need lots of monster functions with
    250+ local variables to make effective use of this, *, which probably
    isn't going to happen).

    16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
    of GPRs AND you have good access to constants.


    On the main ISA's I had tried to generate code for, 16 GPRs was kind of
    a pain as it resulted in fairly high spill rates.

    Though, it would probably be less bad if the compiler was able to use
    all of the registers at the same time without stepping on itself (such
    as dealing with register allocation involving scratch registers while
    also not conflicting with the use of function arguments, ...).


    My code generators had typically only used callee save registers for variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current basic-block).

    On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.


    This is slightly less bad on x86-64, since one can use memory operands
    with most instructions, and the CPU tends to deal fairly well with code
    that has lots of spill-and-fill. This along with instructions having
    access to 32-bit immediate values.


    *: Where, it appears it is most efficient (for non-leaf functions) if
    the number of local variables is roughly twice that of the number of
    CPU registers. If more local variables than this, then spill/fill
    rate goes up significantly, and if less, then the registers aren't
    utilized as effectively.

    Well, except in "tiny leaf" functions, where the criteria is instead
    that the number of local variables be less than the number of scratch
    registers. However, for many/most small leaf functions, the total
    number of variables isn't all that large either.

    The vast majority of leaf functions use less than 16 GPRs, given one has
    a SP not part of GPRs {including arguments and return values}. Once
    one starts placing things like memove(), memset(), sin(), cos(),
    exp(), log()
    in the ISA, it goes up even more.


    Yeah.

    Things like memcpy/memmove/memset/etc, are function calls in cases when
    not directly transformed into register load/store sequences.

    Did end up with an intermediate "memcpy slide", which can handle medium
    size memcpy and memset style operations by branching into a slide.



    As noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers. On a 64 GPR machine, this percentage is slightly
    higher (but, not significantly, since there are few leaf functions
    remaining at this point).


    If one had a 16 GPR machine with 6 usable scratch registers, it is a
    little harder though (as typically these need to cover both any
    variables used by the function, and any temporaries used, ...). There
    are a whole lot more leaf functions that exceed a limit of 6 than of 14.

    But, say, a 32 GPR machine could still do well here.


    Note that there are reasons why I don't claim 64 GPRs as a large
    performance advantage:
    On programs like Doom, the difference is small at best.


    It mostly effects things like GLQuake in my case, mostly because TKRA-GL
    has a lot of functions with a large numbers of local variables (some exceeding 100 local variables).

    Partly though this is due to code that is highly inlined and unrolled
    and uses lots of variables tending to perform better in my case (and
    tightly looping code, with lots of small functions, not so much...).



    Where, function categories:
       Tiny Leaf:
         Everything fits in scratch registers, no stack frame, no calls. >>>    Leaf:
         No function calls (either explicit or implicit);
         Will have a stack frame.
       Non-Leaf:
         May call functions, has a stack frame.

    You are forgetting about FP, GOT, TLS, and whatever resources are
    required
    to do try-throw-catch stuff as demanded by the source language.


    Yeah, possibly true.

    In my case:
      There is no frame pointer, as BGBCC doesn't use one;
        All stack-frames are fixed size, VLA's and alloca use the heap;
      GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
      TLS, accessed via TBR.[...]

    alloca using the heap? Strange to me...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Apr 10 02:24:40 2024
    From Newsgroup: comp.arch

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
    BGB wrote:

    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
    code density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
    performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and
    128 GPRs is wasteful (would likely need lots of monster functions
    with 250+ local variables to make effective use of this, *, which
    probably isn't going to happen).

    16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
    part of GPRs AND you have good access to constants.


    On the main ISA's I had tried to generate code for, 16 GPRs was kind
    of a pain as it resulted in fairly high spill rates.

    Though, it would probably be less bad if the compiler was able to use
    all of the registers at the same time without stepping on itself (such
    as dealing with register allocation involving scratch registers while
    also not conflicting with the use of function arguments, ...).


    My code generators had typically only used callee save registers for
    variables in basic blocks which ended in a function call (in my
    compiler design, both function calls and branches terminating the
    current basic-block).

    On SH, the main way of getting constants (larger than 8 bits) was via
    PC-relative memory loads, which kinda sucked.


    Also the blob of constants needed to be within 512 bytes of the load instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    Usually they were spilled between basic-blocks, with the basic-block
    needing to branch to the following basic-block in these cases.

    Also 8-bit branch displacements are kinda lame, ...


    And, if one wanted a 16-bit branch:
    MOV.W (PC, 4), R0 //load a 16-bit branch displacement
    BRA/F R0
    .L0:
    NOP // delay slot
    .WORD $(Label - .L0)

    Also kinda bad...



    This is slightly less bad on x86-64, since one can use memory operands
    with most instructions, and the CPU tends to deal fairly well with
    code that has lots of spill-and-fill. This along with instructions
    having access to 32-bit immediate values.

    Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have LD-Ops act as if they have 4-6 more registers than they really have. x86
    with 16 GPRs acts like a RISC with 20-24 GPRs as does 360. Does not really take the place of universal constants, but goes a long way.


    Yeah.



    The vast majority of leaf functions use less than 16 GPRs, given one has >>> a SP not part of GPRs {including arguments and return values}. Once
    one starts placing things like memove(), memset(), sin(), cos(),
    exp(), log()
    in the ISA, it goes up even more.


    Yeah.

    Things like memcpy/memmove/memset/etc, are function calls in cases
    when not directly transformed into register load/store sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a single inst- ruction.


    I have no high-level memory move/copy/set instructions.
    Only loads/stores...


    For small copies, can encode them inline, but past a certain size this
    becomes too bulky.

    A copy loop makes more sense for bigger copies, but has a high overhead
    for small to medium copy.


    So, there is a size range where doing it inline would be too bulky, but
    a loop caries an undesirable level of overhead.

    Ended up doing these with "slides", which end up eating roughly several
    kB of code space, but was more compact than using larger inline copies.


    Say (IIRC):
    128 bytes or less: Inline Ld/St sequence
    129 bytes to 512B: Slide
    Over 512B: Call "memcpy()" or similar.

    The slide generally has entry points in multiples of 32 bytes, and
    operates in reverse order. So, if not a multiple of 32 bytes, the last
    bytes need to be handled externally prior to branching into the slide.

    Though, this is only used for fixed-size copies (or "memcpy()" when
    value is constant).


    Say:

    __memcpy64_512_ua:
    MOV.Q (R5, 480), R20
    MOV.Q (R5, 488), R21
    MOV.Q (R5, 496), R22
    MOV.Q (R5, 504), R23
    MOV.Q R20, (R4, 480)
    MOV.Q R21, (R4, 488)
    MOV.Q R22, (R4, 496)
    MOV.Q R23, (R4, 504)

    __memcpy64_480_ua:
    MOV.Q (R5, 448), R20
    MOV.Q (R5, 456), R21
    MOV.Q (R5, 464), R22
    MOV.Q (R5, 472), R23
    MOV.Q R20, (R4, 448)
    MOV.Q R21, (R4, 456)
    MOV.Q R22, (R4, 464)
    MOV.Q R23, (R4, 472)

    ...

    __memcpy64_32_ua:
    MOV.Q (R5), R20
    MOV.Q (R5, 8), R21
    MOV.Q (R5, 16), R22
    MOV.Q (R5, 24), R23
    MOV.Q R20, (R4)
    MOV.Q R21, (R4, 8)
    MOV.Q R22, (R4, 16)
    MOV.Q R23, (R4, 24)
    RTS




    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into a slide.

    MMs and MSs that do not cross page boundaries are ATOMIC. The entire system sees only the before or only the after state and nothing in between. This means one can start (queue up) a SATA disk access without obtaining a lock
    to the device--simply because one can fill in all the data of a command in
    a single instruction which smells ATOMIC to all interested 3rd parties.


    My case, non-atomic, polling IO.


    Code fragment:
    while(ct<cte)
    {
    P_SPI_QDATA=0xFFFFFFFFFFFFFFFFULL;
    P_SPI_CTRL=tkspi_ctl_status|SPICTRL_XMIT8X;
    v=P_SPI_CTRL;
    while(v&SPICTRL_BUSY)
    v=P_SPI_CTRL;
    *(u64 *)ct=P_SPI_QDATA;
    ct+=8;
    }

    Where the MMIO interface allows sending/receiving 8 bytes at a time to
    avoid bogging down at around 500 K/s or so (with 8B transfers, could theoretically do 4 MB/s; though it is only ~ 1.5 MB/s with 12.5 MHz SPI).

    Though, this is part of why I had ended up LZ compressing damn near
    everything (LZ4 or RP2 being faster than sending ~ 3x as much data over
    the SPI interface).


    Hadn't generally used Huffman as the additional compression wasn't worth
    the fairly steep performance cost (with something like Deflate, it would barely be much faster than the bare SPI interface).



    Did recently come up with a "pseudo entropic" coding that seems
    promising in some testing:
    Rank symbols by probability, sending the most common 128 symbols;
    Send the encoded symbols as table indices via bytes, say:
    00..78: Pair of symbol indices, 00..0A;
    7F: Escaped Byte
    80..FF: Symbol Index

    Which, while it seems like this would likely fail to do much of
    anything, it "sorta works", and is much faster to unpack than Huffman.

    Though, if the distribution is "too flat", one needs to be able to fall
    back to raw bytes.

    Had experimentally written a compressor based around this scheme, and
    while not as fast as LZ4, it did give compression much closer to Deflate.


    Where, IME, on my current main PC:
    LZMA: ~ 35 MB/s
    Bitwise range coder.
    Deflate: ~ 200 MB/s
    Huffman based, symbols limited to 15 bits.
    TKuLZ: ~ 350 MB/s
    Resembles a Deflate / LZ4 Hybrid.
    Huffman based, symbols limited to 12 bits.
    TKFLZH: ~ 500 MB/s
    Similar to a more elaborate version of TKuLZ.
    Huffman symbols limited to 13 bits.
    TKDELZ: ~ 700 MB/s
    Similar to the prior to, but:
    Splits symbols into separately-coded blocks;
    Uses an interleaved encoding scheme, decoding 4 symbols at a time.
    PSELZ: ~ 1.0 GB/s
    Uses separate symbol blocks, with the "pseudo entropic" encoding.
    RP2: ~ 1.8 GB/s
    Byte oriented
    LZ4: ~ 2.1 GB/s



    Though, RP2 and LZ4 switch places on BJX2, where RP2 is both slightly
    faster and gives slightly better compression.

    I suspect this is likely because of differences in the relative cost of
    byte loads and branch mispredicts.


    Note that TKuLZ/TKFLZH/TKDELZ/PSELZ used a similar structure for
    encoding LZ matches:
    TAG
    (Raw Length)
    (Match Length)
    Match Distance
    (Literal Bytes)
    Where, TAG has a structure like:
    (7:5): Raw Length (0..6, 7 = Separate length)
    (4:0): Match Length (3..33, 34 = Separate length)

    Though, the former 3 were using a combination nybble-stream and bitstream.

    Had considered a nybble stream for PSELZ, but ended up using bytes as
    bytes are faster.



    As noted, on a 32 GPR machine, most leaf functions can fit entirely in
    scratch registers.

    Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting totally screwed.


    OK.


                        On a 64 GPR machine, this percentage is slightly
    higher (but, not significantly, since there are few leaf functions
    remaining at this point).


    If one had a 16 GPR machine with 6 usable scratch registers, it is a
    little harder though (as typically these need to cover both any
    variables used by the function, and any temporaries used, ...). There
    are a whole lot more leaf functions that exceed a limit of 6 than of 14.

    The data back in the R2000-3000 days indicated that 32 GPRs has a 15%+ advantage over a 16 GPRs; while 84 had only a 3% advantage.


    Probably true enough.


    But, say, a 32 GPR machine could still do well here.


    Note that there are reasons why I don't claim 64 GPRs as a large
    performance advantage:
    On programs like Doom, the difference is small at best.


    It mostly effects things like GLQuake in my case, mostly because
    TKRA-GL has a lot of functions with a large numbers of local variables
    (some exceeding 100 local variables).

    Partly though this is due to code that is highly inlined and unrolled
    and uses lots of variables tending to perform better in my case (and
    tightly looping code, with lots of small functions, not so much...).



    Where, function categories:
       Tiny Leaf:
         Everything fits in scratch registers, no stack frame, no calls. >>>>    Leaf:
         No function calls (either explicit or implicit);
         Will have a stack frame.
       Non-Leaf:
         May call functions, has a stack frame.

    You are forgetting about FP, GOT, TLS, and whatever resources are
    required
    to do try-throw-catch stuff as demanded by the source language.


    Yeah, possibly true.

    In my case:
       There is no frame pointer, as BGBCC doesn't use one;

    Can't do PASCAL and other ALOGO derived languages with block structure.


    Nothing prevents having a frame pointer, just BGBCC doesn't do so as it doesn't really gain anything if one has fixed-size stack frames (and is another register to be saved/restored).

    Granted, it would make stack-walking easier.
    As is, one needs to use a similar strategy to how one does stack
    unwinding in the Win64 ABI (namely looking stuff up in a table, and then parsing the instruction sequence).


         All stack-frames are fixed size, VLA's and alloca use the heap;

    longjump() is at a serious disadvantage here. desctructors are sometimes hard to position on the stack.


    Yeah... If you use longjmp, any VLA's or alloca's are gonna be leaked...

    Nothing really mandates that longjmp + alloca not result in a memory
    leak though (or that alloca can't be implemented as a fancy wrapper over malloc).


       GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
       TLS, accessed via TBR.

    Try/throw/catch:
       Mostly N/A for leaf functions.

    Any function that can "throw", is in effect no longer a leaf function.
    Implicitly, any function which uses "variant" or similar is also, no
    longer a leaf function.

    You do realize that there is a set of #define-s that can implement try-throw-catch without requiring any subroutines ?!?


    ?...

    Throw was implemented as a runtime call in my case.

    Though, try/catch involves arcane metadata and compiler magic (basically borrowing a similar design to the WinCE and Win64 exception handling).

    Though, could have in theory gone the SEH route and implemented it using hidden runtime calls and a linked list. There are pros/cons either way
    (and SEH would have an on-average lower overhead for C, since when it is
    not used, its cost would effectively drop to 0; unlike the cost of
    PE/COFF exception-handling tables, and needing to provide dummy
    exception catch/continue blobs in the off-chance exceptions were used
    and passed through C land).


    Need for GBR save/restore effectively excludes a function from being
    tiny-leaf. This may happen, say, if a function accesses global
    variables and may be called as a function pointer.

    ------------------------------------------------------

    One "TODO" here would be to merge constants with the same "actual"
    value into the same register. At present, they will be duplicated if
    the types are sufficiently different (such as integer 0 vs NULL).

    In practice, the upper 48-bits of a extern variable is completely shared whereas the lower 16-bits are unique.

    For functions with dynamic assignment, immediate values are more
    likely to be used. If the code-generator were clever, potentially it
    could exclude assigning registers to constants which are only used by
    instructions which can encode them directly as an immediate.
    Currently, BGBCC is not that clever.

    And then there are languages like PL/1 and FORTRAN where the compiler
    has to figure out how big an intermediate array is, allocate it, perform
    the math, and then deallocate it.


    I don't expect BJX2 and FORTRAN to cross paths...


    Or, say:
       y=x+31;  //31 only being used here, and fits easily in an Imm9.
    Ideally, compiler could realize 31 does not need a register here.


    Well, and another weakness is with temporaries that exist as function
    arguments:
    If static assigned, the "target variable directly to argument
    register" optimization can't be used (it ends up needing to go into a
    callee-save register and then be MOV'ed into the argument register;
    otherwise the compiler breaks...).

    Though, I guess possible could be that the compiler could try to
    partition temporaries that are used exclusively as function arguments
    into a different category from "normal" temporaries (or those whose
    values may cross a basic-block boundary), and then avoid
    statically-assigning them (and somehow not cause this to effectively
    break the full-static-assignment scheme in the process).

    Brian's compiler finds the largest argument list and the largest return
    value list and merges them into a single area on the stack used only
    for passing arguments and results across the call interface. And the
    <static> SP points at this area.


    The issue isn't with stack space, this part is straightforward.

    Rather, that in the IR stage, one has something like (pseduocode):
    t4 := t0 + 1;
    t5 := t1 + 2;
    t6 := func(t4, t5);
    t7 := t6 + 3;

    Where, at the ASM level, one could do, say:
    ADD R8, 1, R4
    ADD R9, 2, R5
    BSR func
    ADD R2, 3, R10

    But... Pulling this off without the compiler and/or compiled program
    exploding in the process, is easier said than done.

    OTOH:
    ADD R8, 1, R11
    ADD R9, 2, R12
    MOV R11, R4
    MOV R12, R5
    BSR func
    MOV R2, R11
    ADD R11, 3, R10

    Is, not quite as efficient, but despite some efforts, is closer to the
    current situation than to the one above. There are some cases where
    these optimizations can be performed, but only if the variables in
    question are not static-assigned, which forces the latter scenario.


    Though, IIRC, I had also considered the possibility of a temporary
    "virtual assignment", allowing the argument value to be temporarily
    assigned to a function argument register, then going "poof" and
    disappearing when the function is called. Hadn't yet thought of a good
    way to add this logic to the register allocator though.


    But, yeah, compiler stuff is really fiddly...


    More orthogonality helps.

    These parts of my compiler are a horrible mess, and rather brittle...

    Partial reason there is no RISC-V support in BGBCC, is because the ABI
    is different, and the current design of the register allocator can't
    deal with a different ABI design.




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Apr 10 02:41:01 2024
    From Newsgroup: comp.arch

    On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:
    On 4/9/2024 3:47 PM, BGB-Alt wrote:
    On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/9/2024 1:24 PM, Thomas Koenig wrote:
    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Maybe one more thing: In order to justify the more complex encoding, >>>>> I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.


    For 32-bit instructions at least, 64 GPRs can work out OK.

    Though, the gain of 64 over 32 seems to be fairly small for most
    "typical" code, mostly bringing a benefit if one is spending a lot
    of CPU time in functions that have large numbers of local variables
    all being used at the same time.


    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
    code density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
    performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and
    128 GPRs is wasteful (would likely need lots of monster functions
    with 250+ local variables to make effective use of this, *, which
    probably isn't going to happen).

    16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
    part of GPRs AND you have good access to constants.


    On the main ISA's I had tried to generate code for, 16 GPRs was kind
    of a pain as it resulted in fairly high spill rates.

    Though, it would probably be less bad if the compiler was able to use
    all of the registers at the same time without stepping on itself (such
    as dealing with register allocation involving scratch registers while
    also not conflicting with the use of function arguments, ...).


    My code generators had typically only used callee save registers for
    variables in basic blocks which ended in a function call (in my
    compiler design, both function calls and branches terminating the
    current basic-block).

    On SH, the main way of getting constants (larger than 8 bits) was via
    PC-relative memory loads, which kinda sucked.


    This is slightly less bad on x86-64, since one can use memory operands
    with most instructions, and the CPU tends to deal fairly well with
    code that has lots of spill-and-fill. This along with instructions
    having access to 32-bit immediate values.


    *: Where, it appears it is most efficient (for non-leaf functions)
    if the number of local variables is roughly twice that of the number
    of CPU registers. If more local variables than this, then spill/fill
    rate goes up significantly, and if less, then the registers aren't
    utilized as effectively.

    Well, except in "tiny leaf" functions, where the criteria is instead
    that the number of local variables be less than the number of
    scratch registers. However, for many/most small leaf functions, the
    total number of variables isn't all that large either.

    The vast majority of leaf functions use less than 16 GPRs, given one has >>> a SP not part of GPRs {including arguments and return values}. Once
    one starts placing things like memove(), memset(), sin(), cos(),
    exp(), log()
    in the ISA, it goes up even more.


    Yeah.

    Things like memcpy/memmove/memset/etc, are function calls in cases
    when not directly transformed into register load/store sequences.

    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into a slide.



    As noted, on a 32 GPR machine, most leaf functions can fit entirely in
    scratch registers. On a 64 GPR machine, this percentage is slightly
    higher (but, not significantly, since there are few leaf functions
    remaining at this point).


    If one had a 16 GPR machine with 6 usable scratch registers, it is a
    little harder though (as typically these need to cover both any
    variables used by the function, and any temporaries used, ...). There
    are a whole lot more leaf functions that exceed a limit of 6 than of 14.

    But, say, a 32 GPR machine could still do well here.


    Note that there are reasons why I don't claim 64 GPRs as a large
    performance advantage:
    On programs like Doom, the difference is small at best.


    It mostly effects things like GLQuake in my case, mostly because
    TKRA-GL has a lot of functions with a large numbers of local variables
    (some exceeding 100 local variables).

    Partly though this is due to code that is highly inlined and unrolled
    and uses lots of variables tending to perform better in my case (and
    tightly looping code, with lots of small functions, not so much...).



    Where, function categories:
       Tiny Leaf:
         Everything fits in scratch registers, no stack frame, no calls. >>>>    Leaf:
         No function calls (either explicit or implicit);
         Will have a stack frame.
       Non-Leaf:
         May call functions, has a stack frame.

    You are forgetting about FP, GOT, TLS, and whatever resources are
    required
    to do try-throw-catch stuff as demanded by the source language.


    Yeah, possibly true.

    In my case:
       There is no frame pointer, as BGBCC doesn't use one;
         All stack-frames are fixed size, VLA's and alloca use the heap;
       GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
       TLS, accessed via TBR.[...]

    alloca using the heap? Strange to me...


    Well, in this case:
    The alloca calls are turned into calls which allocate the memory blob
    and add it to a linked list;
    when the function returns, everything in the linked list is freed;
    Then, it internally pulls this off via malloc and free.

    Also the typical default stack size in this case is 128K, so trying to
    put big allocations on the stack is more liable to result in a stack
    overflow.

    Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
    heap allocation is not too slow in this case.


    Though, at the same time, ideally one limits use of language features
    where the code-generation degenerates into a mess of hidden runtime
    calls. These cases are not ideal for performance...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 10 17:12:47 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.

    Usually they were spilled between basic-blocks, with the basic-block
    needing to branch to the following basic-block in these cases.

    Also 8-bit branch displacements are kinda lame, ...

    Why do that to yourself ??

    And, if one wanted a 16-bit branch:
    MOV.W (PC, 4), R0 //load a 16-bit branch displacement
    BRA/F R0
    .L0:
    NOP // delay slot
    .WORD $(Label - .L0)

    Also kinda bad...

    Can you say Yech !!

    Things like memcpy/memmove/memset/etc, are function calls in cases
    when not directly transformed into register load/store sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a single inst-
    ruction.


    I have no high-level memory move/copy/set instructions.
    Only loads/stores...

    You have the power to fix it.........

    For small copies, can encode them inline, but past a certain size this becomes too bulky.

    A copy loop makes more sense for bigger copies, but has a high overhead
    for small to medium copy.


    So, there is a size range where doing it inline would be too bulky, but
    a loop caries an undesirable level of overhead.

    All the more reason to put it (a highly useful unit of work) into an instruction.

    Ended up doing these with "slides", which end up eating roughly several
    kB of code space, but was more compact than using larger inline copies.


    Say (IIRC):
    128 bytes or less: Inline Ld/St sequence
    129 bytes to 512B: Slide
    Over 512B: Call "memcpy()" or similar.

    Versus::
    1-infinity: use MM instruction.

    The slide generally has entry points in multiples of 32 bytes, and
    operates in reverse order. So, if not a multiple of 32 bytes, the last
    bytes need to be handled externally prior to branching into the slide.

    Does this remain sequentially consistent ??

    Though, this is only used for fixed-size copies (or "memcpy()" when
    value is constant).


    Say:

    __memcpy64_512_ua:
    MOV.Q (R5, 480), R20
    MOV.Q (R5, 488), R21
    MOV.Q (R5, 496), R22
    MOV.Q (R5, 504), R23
    MOV.Q R20, (R4, 480)
    MOV.Q R21, (R4, 488)
    MOV.Q R22, (R4, 496)
    MOV.Q R23, (R4, 504)

    __memcpy64_480_ua:
    MOV.Q (R5, 448), R20
    MOV.Q (R5, 456), R21
    MOV.Q (R5, 464), R22
    MOV.Q (R5, 472), R23
    MOV.Q R20, (R4, 448)
    MOV.Q R21, (R4, 456)
    MOV.Q R22, (R4, 464)
    MOV.Q R23, (R4, 472)

    ....

    __memcpy64_32_ua:
    MOV.Q (R5), R20
    MOV.Q (R5, 8), R21
    MOV.Q (R5, 16), R22
    MOV.Q (R5, 24), R23
    MOV.Q R20, (R4)
    MOV.Q R21, (R4, 8)
    MOV.Q R22, (R4, 16)
    MOV.Q R23, (R4, 24)
    RTS

    Duff's device in any other name.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Apr 10 17:29:22 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load
    instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.

    It does occupy some icache space, however; have you boosted the icache
    size to compensate?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Apr 10 11:57:14 2024
    From Newsgroup: comp.arch

    On 4/10/2024 12:41 AM, BGB wrote:
    On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:
    On 4/9/2024 3:47 PM, BGB-Alt wrote:
    On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/9/2024 1:24 PM, Thomas Koenig wrote:
    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Maybe one more thing: In order to justify the more complex encoding, >>>>>> I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.


    For 32-bit instructions at least, 64 GPRs can work out OK.

    Though, the gain of 64 over 32 seems to be fairly small for most
    "typical" code, mostly bringing a benefit if one is spending a lot
    of CPU time in functions that have large numbers of local variables >>>>> all being used at the same time.


    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
    code density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
    performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and
    128 GPRs is wasteful (would likely need lots of monster functions
    with 250+ local variables to make effective use of this, *, which
    probably isn't going to happen).

    16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
    part of GPRs AND you have good access to constants.


    On the main ISA's I had tried to generate code for, 16 GPRs was kind
    of a pain as it resulted in fairly high spill rates.

    Though, it would probably be less bad if the compiler was able to use
    all of the registers at the same time without stepping on itself
    (such as dealing with register allocation involving scratch registers
    while also not conflicting with the use of function arguments, ...).


    My code generators had typically only used callee save registers for
    variables in basic blocks which ended in a function call (in my
    compiler design, both function calls and branches terminating the
    current basic-block).

    On SH, the main way of getting constants (larger than 8 bits) was via
    PC-relative memory loads, which kinda sucked.


    This is slightly less bad on x86-64, since one can use memory
    operands with most instructions, and the CPU tends to deal fairly
    well with code that has lots of spill-and-fill. This along with
    instructions having access to 32-bit immediate values.


    *: Where, it appears it is most efficient (for non-leaf functions)
    if the number of local variables is roughly twice that of the
    number of CPU registers. If more local variables than this, then
    spill/fill rate goes up significantly, and if less, then the
    registers aren't utilized as effectively.

    Well, except in "tiny leaf" functions, where the criteria is
    instead that the number of local variables be less than the number
    of scratch registers. However, for many/most small leaf functions,
    the total number of variables isn't all that large either.

    The vast majority of leaf functions use less than 16 GPRs, given one
    has
    a SP not part of GPRs {including arguments and return values}. Once
    one starts placing things like memove(), memset(), sin(), cos(),
    exp(), log()
    in the ISA, it goes up even more.


    Yeah.

    Things like memcpy/memmove/memset/etc, are function calls in cases
    when not directly transformed into register load/store sequences.

    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into a
    slide.



    As noted, on a 32 GPR machine, most leaf functions can fit entirely
    in scratch registers. On a 64 GPR machine, this percentage is
    slightly higher (but, not significantly, since there are few leaf
    functions remaining at this point).


    If one had a 16 GPR machine with 6 usable scratch registers, it is a
    little harder though (as typically these need to cover both any
    variables used by the function, and any temporaries used, ...). There
    are a whole lot more leaf functions that exceed a limit of 6 than of 14. >>>
    But, say, a 32 GPR machine could still do well here.


    Note that there are reasons why I don't claim 64 GPRs as a large
    performance advantage:
    On programs like Doom, the difference is small at best.


    It mostly effects things like GLQuake in my case, mostly because
    TKRA-GL has a lot of functions with a large numbers of local
    variables (some exceeding 100 local variables).

    Partly though this is due to code that is highly inlined and unrolled
    and uses lots of variables tending to perform better in my case (and
    tightly looping code, with lots of small functions, not so much...).



    Where, function categories:
       Tiny Leaf:
         Everything fits in scratch registers, no stack frame, no calls. >>>>>    Leaf:
         No function calls (either explicit or implicit);
         Will have a stack frame.
       Non-Leaf:
         May call functions, has a stack frame.

    You are forgetting about FP, GOT, TLS, and whatever resources are
    required
    to do try-throw-catch stuff as demanded by the source language.


    Yeah, possibly true.

    In my case:
       There is no frame pointer, as BGBCC doesn't use one;
         All stack-frames are fixed size, VLA's and alloca use the heap; >>>    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
       TLS, accessed via TBR.[...]

    alloca using the heap? Strange to me...


    Well, in this case:
    The alloca calls are turned into calls which allocate the memory blob
    and add it to a linked list;
    when the function returns, everything in the linked list is freed;
    Then, it internally pulls this off via malloc and free.

    Also the typical default stack size in this case is 128K, so trying to
    put big allocations on the stack is more liable to result in a stack overflow.

    Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
    heap allocation is not too slow in this case.


    Though, at the same time, ideally one limits use of language features
    where the code-generation degenerates into a mess of hidden runtime
    calls. These cases are not ideal for performance...



    Sometimes alloca is useful wrt offsetting the stack to avoid false
    sharing between stacks. Intel wrote a little paper that addresses this:

    https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf

    Remember that one?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Wed Apr 10 15:51:07 2024
    From Newsgroup: comp.arch

    On 4/10/2024 12:12 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load
    instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.


    Yeah.

    This was why some of the first things I did when I started extending
    SH-4 were:
    Adding mechanisms to build constants inline;
    Adding Load/Store ops with a displacement (albeit with encodings
    borrowed from SH-2A);
    Adding 3R and 3RI encodings (originally Imm8 for 3RI).


    Did have a mess when I later extended the ISA to 32 GPRs, as (like with
    BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.


    Usually they were spilled between basic-blocks, with the basic-block
    needing to branch to the following basic-block in these cases.

    Also 8-bit branch displacements are kinda lame, ...

    Why do that to yourself ??


    I didn't design SuperH, Hitachi did...

    All of this stuff was apparently sufficient for the SEGA
    32X/Saturn/Dreamcast consoles... (vs the Genesis/MegaDrive using a
    M68000, and Master System using a Z80).


    I guess for a while it was also popular in CD-ROM and HDD controllers.
    I guess after SEGA left the game-console market, they had continued
    using it for a while in arcade machines, before apparently later jumping
    over to x86 via low-end PC motherboards (I guess it being cheaper since
    the mid/late 2000s to build an arcade machine with off-the-shelf PC parts).

    Saw a video where a guy was messing with one of these, where I guess
    despite being built with low-end PC parts (and an entry-level graphics
    card), the parts were balanced well enough that it still gave fairly
    decent gaming performance.



    But, with BJX1, I had added Disp16 branches.

    With BJX2, they were replaced with 20 bit branches. These have the merit
    of being able to branch anywhere within a Doom or Quake sized binary.


    And, if one wanted a 16-bit branch:
       MOV.W (PC, 4), R0  //load a 16-bit branch displacement
       BRA/F R0
       .L0:
       NOP    // delay slot
       .WORD $(Label - .L0)

    Also kinda bad...

    Can you say Yech !!


    Yeah.
    This sort of stuff created strong incentive for ISA redesign...

    Granted, it is possible had I instead started with RISC-V instead of
    SuperH, it is probable BJX2 wouldn't exist.


    Though, at the time, the original thinking was that SuperH having
    smaller instructions meant it would have better code density than RV32I
    or similar. Turns out not really, as the penalty of the 16 bit ops was
    needing almost twice as many on average.


    Things like memcpy/memmove/memset/etc, are function calls in cases
    when not directly transformed into register load/store sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a single
    inst-
    ruction.


    I have no high-level memory move/copy/set instructions.
    Only loads/stores...

    You have the power to fix it.........


    But, at what cost...

    I had generally avoided anything that will have required microcode or
    shoving state-machines into the pipeline or similar.

    Things like Load/Store-Multiple or


    For small copies, can encode them inline, but past a certain size this
    becomes too bulky.

    A copy loop makes more sense for bigger copies, but has a high
    overhead for small to medium copy.


    So, there is a size range where doing it inline would be too bulky,
    but a loop caries an undesirable level of overhead.

    All the more reason to put it (a highly useful unit of work) into an instruction.


    This is an area where "slides" work well, the main cost is mostly the
    bulk that the slide adds to the binary (albeit, it is one-off).

    Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...


    For looping memcpy, it makes sense to copy 64 or 128 bytes per loop
    iteration or so to try to limit looping overhead.

    Though, leveraging the memcpy slide for the interior part of the copy
    could be possible in theory as well.



    For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot
    shorter (a big part of LZ decoder performance mostly being in
    fine-tuning the logic for the match copies).

    Though, this is part of why my runtime library had added a
    "_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
    which can consolidate this rather than needing to do it one-off for each
    LZ decoder (as I see it, it is a similar issue to not wanting code to endlessly re-roll stuff for functions like memcpy or malloc/free, *).


    *: Though, nevermind that the standard C interface for malloc is
    annoyingly minimal, and ends up requiring most non-trivial programs to
    roll their own memory management.


    Ended up doing these with "slides", which end up eating roughly
    several kB of code space, but was more compact than using larger
    inline copies.


    Say (IIRC):
       128 bytes or less: Inline Ld/St sequence
       129 bytes to 512B: Slide
       Over 512B: Call "memcpy()" or similar.

    Versus::
        1-infinity: use MM instruction.


    Yeah, but it makes the CPU logic more expensive.


    The slide generally has entry points in multiples of 32 bytes, and
    operates in reverse order. So, if not a multiple of 32 bytes, the last
    bytes need to be handled externally prior to branching into the slide.

    Does this remain sequentially consistent ??


    Within a thread, it is fine.

    Main wonk is that it does start copying from the high address first.
    Presumably interrupts or similar wont be messing with application memory
    mid memcpy.

    The looping memcpy's generally work from low to high addresses though.


    Though, this is only used for fixed-size copies (or "memcpy()" when
    value is constant).


    Say:

    __memcpy64_512_ua:
       MOV.Q        (R5, 480), R20
       MOV.Q        (R5, 488), R21
       MOV.Q        (R5, 496), R22
       MOV.Q        (R5, 504), R23
       MOV.Q        R20, (R4, 480)
       MOV.Q        R21, (R4, 488)
       MOV.Q        R22, (R4, 496)
       MOV.Q        R23, (R4, 504)

    __memcpy64_480_ua:
       MOV.Q        (R5, 448), R20
       MOV.Q        (R5, 456), R21
       MOV.Q        (R5, 464), R22
       MOV.Q        (R5, 472), R23
       MOV.Q        R20, (R4, 448)
       MOV.Q        R21, (R4, 456)
       MOV.Q        R22, (R4, 464)
       MOV.Q        R23, (R4, 472)

    ....

    __memcpy64_32_ua:
       MOV.Q        (R5), R20
       MOV.Q        (R5, 8), R21
       MOV.Q        (R5, 16), R22
       MOV.Q        (R5, 24), R23
       MOV.Q        R20, (R4)
       MOV.Q        R21, (R4, 8)
       MOV.Q        R22, (R4, 16)
       MOV.Q        R23, (R4, 24)
       RTS

    Duff's device in any other name.

    More or less, though I think the idea of Duff's device is specifically
    in the way that it abuses the while-loop and switch constructs.

    This is basically just an unrolled slide.
    So, where one branches into it, determines how much is copied.

    For small-to-medium copies, the advantage is mostly that this avoids
    looping overhead.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 10 21:19:20 2024
    From Newsgroup: comp.arch

    BGB-Alt wrote:

    On 4/10/2024 12:12 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load
    instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.


    Yeah.

    This was why some of the first things I did when I started extending
    SH-4 were:
    Adding mechanisms to build constants inline;
    Adding Load/Store ops with a displacement (albeit with encodings
    borrowed from SH-2A);
    Adding 3R and 3RI encodings (originally Imm8 for 3RI).

    My suggestion is that:: "Now that you have screwed around for a while,
    Why not take that experience and do a new ISA without any of those
    mistakes in it" ??

    Did have a mess when I later extended the ISA to 32 GPRs, as (like with
    BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.


    Usually they were spilled between basic-blocks, with the basic-block
    needing to branch to the following basic-block in these cases.

    Also 8-bit branch displacements are kinda lame, ...

    Why do that to yourself ??


    I didn't design SuperH, Hitachi did...

    But you did not fix them en massé, and you complain about them
    at least once a week. There comes a time when it takes less time
    and less courage to do that big switch and clean up all that mess.


    But, with BJX1, I had added Disp16 branches.

    With BJX2, they were replaced with 20 bit branches. These have the merit
    of being able to branch anywhere within a Doom or Quake sized binary.


    And, if one wanted a 16-bit branch:
       MOV.W (PC, 4), R0  //load a 16-bit branch displacement
       BRA/F R0
       .L0:
       NOP    // delay slot
       .WORD $(Label - .L0)

    Also kinda bad...

    Can you say Yech !!


    Yeah.
    This sort of stuff created strong incentive for ISA redesign...

    Maybe consider now as the appropriate time to strt.

    Granted, it is possible had I instead started with RISC-V instead of
    SuperH, it is probable BJX2 wouldn't exist.


    Though, at the time, the original thinking was that SuperH having
    smaller instructions meant it would have better code density than RV32I
    or similar. Turns out not really, as the penalty of the 16 bit ops was needing almost twice as many on average.

    My 66000 only requires 70% the instruction count of RISC-V,
    Yours could too ................

    Things like memcpy/memmove/memset/etc, are function calls in cases
    when not directly transformed into register load/store sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a single
    inst-
    ruction.


    I have no high-level memory move/copy/set instructions.
    Only loads/stores...

    You have the power to fix it.........


    But, at what cost...

    You would not have to spend hours a week defending the indefensible !!

    I had generally avoided anything that will have required microcode or shoving state-machines into the pipeline or similar.

    Things as simple as IDIV and FDIV require sequencers.
    But LDM, STM, MM require sequencers simpler than IDIV and FDIV !!

    Things like Load/Store-Multiple or

    If you like polluted ICaches..............

    For small copies, can encode them inline, but past a certain size this
    becomes too bulky.

    A copy loop makes more sense for bigger copies, but has a high
    overhead for small to medium copy.


    So, there is a size range where doing it inline would be too bulky,
    but a loop caries an undesirable level of overhead.

    All the more reason to put it (a highly useful unit of work) into an
    instruction.


    This is an area where "slides" work well, the main cost is mostly the
    bulk that the slide adds to the binary (albeit, it is one-off).

    Consider that the predictor getting into the slide the first time
    always mispredicts !!

    Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...

    What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably,
    yet a HW sequencer only has to avoid asserting a single byte write enable
    once.

    For looping memcpy, it makes sense to copy 64 or 128 bytes per loop iteration or so to try to limit looping overhead.

    On low end machines, you want to operate at cache port width,
    On high end machines, you want to operate at cache line widths per port.
    This is essentially impossible using slides.....here, the same code is
    not optimal across a line of implementations.

    Though, leveraging the memcpy slide for the interior part of the copy
    could be possible in theory as well.

    What do you do when the STAT drive wants to write a whole page ??

    For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot shorter (a big part of LZ decoder performance mostly being in
    fine-tuning the logic for the match copies).

    Though, this is part of why my runtime library had added a
    "_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
    which can consolidate this rather than needing to do it one-off for each
    LZ decoder (as I see it, it is a similar issue to not wanting code to endlessly re-roll stuff for functions like memcpy or malloc/free, *).


    *: Though, nevermind that the standard C interface for malloc is
    annoyingly minimal, and ends up requiring most non-trivial programs to
    roll their own memory management.


    Ended up doing these with "slides", which end up eating roughly
    several kB of code space, but was more compact than using larger
    inline copies.


    Say (IIRC):
       128 bytes or less: Inline Ld/St sequence
       129 bytes to 512B: Slide
       Over 512B: Call "memcpy()" or similar.

    Versus::
        1-infinity: use MM instruction.


    Yeah, but it makes the CPU logic more expensive.

    By what, 37-gates ??

    The slide generally has entry points in multiples of 32 bytes, and
    operates in reverse order. So, if not a multiple of 32 bytes, the last
    bytes need to be handled externally prior to branching into the slide.

    Does this remain sequentially consistent ??


    Within a thread, it is fine.

    What if a SATA drive is reading while you are writing !!
    That is, DMA is no different than multi-threaded applications--except
    DMA cannot perform locks.

    Main wonk is that it does start copying from the high address first. Presumably interrupts or similar wont be messing with application memory
    mid memcpy.

    The only things wanting high-low access patterns are dumping stuff to the stack. The fact you CAN get away with it most of the time is no excuse.

    The looping memcpy's generally work from low to high addresses though.

    As does all string processing.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Wed Apr 10 16:53:32 2024
    From Newsgroup: comp.arch

    On 4/10/2024 1:57 PM, Chris M. Thomasson wrote:
    On 4/10/2024 12:41 AM, BGB wrote:
    On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:
    On 4/9/2024 3:47 PM, BGB-Alt wrote:
    On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/9/2024 1:24 PM, Thomas Koenig wrote:
    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Maybe one more thing: In order to justify the more complex encoding, >>>>>>> I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.


    For 32-bit instructions at least, 64 GPRs can work out OK.

    Though, the gain of 64 over 32 seems to be fairly small for most
    "typical" code, mostly bringing a benefit if one is spending a lot >>>>>> of CPU time in functions that have large numbers of local
    variables all being used at the same time.


    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for >>>>>> code density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for >>>>>> performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and >>>>>> 128 GPRs is wasteful (would likely need lots of monster functions >>>>>> with 250+ local variables to make effective use of this, *, which >>>>>> probably isn't going to happen).

    16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
    part of GPRs AND you have good access to constants.


    On the main ISA's I had tried to generate code for, 16 GPRs was kind
    of a pain as it resulted in fairly high spill rates.

    Though, it would probably be less bad if the compiler was able to
    use all of the registers at the same time without stepping on itself
    (such as dealing with register allocation involving scratch
    registers while also not conflicting with the use of function
    arguments, ...).


    My code generators had typically only used callee save registers for
    variables in basic blocks which ended in a function call (in my
    compiler design, both function calls and branches terminating the
    current basic-block).

    On SH, the main way of getting constants (larger than 8 bits) was
    via PC-relative memory loads, which kinda sucked.


    This is slightly less bad on x86-64, since one can use memory
    operands with most instructions, and the CPU tends to deal fairly
    well with code that has lots of spill-and-fill. This along with
    instructions having access to 32-bit immediate values.


    *: Where, it appears it is most efficient (for non-leaf functions) >>>>>> if the number of local variables is roughly twice that of the
    number of CPU registers. If more local variables than this, then
    spill/fill rate goes up significantly, and if less, then the
    registers aren't utilized as effectively.

    Well, except in "tiny leaf" functions, where the criteria is
    instead that the number of local variables be less than the number >>>>>> of scratch registers. However, for many/most small leaf functions, >>>>>> the total number of variables isn't all that large either.

    The vast majority of leaf functions use less than 16 GPRs, given
    one has
    a SP not part of GPRs {including arguments and return values}. Once >>>>> one starts placing things like memove(), memset(), sin(), cos(),
    exp(), log()
    in the ISA, it goes up even more.


    Yeah.

    Things like memcpy/memmove/memset/etc, are function calls in cases
    when not directly transformed into register load/store sequences.

    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into a
    slide.



    As noted, on a 32 GPR machine, most leaf functions can fit entirely
    in scratch registers. On a 64 GPR machine, this percentage is
    slightly higher (but, not significantly, since there are few leaf
    functions remaining at this point).


    If one had a 16 GPR machine with 6 usable scratch registers, it is a
    little harder though (as typically these need to cover both any
    variables used by the function, and any temporaries used, ...).
    There are a whole lot more leaf functions that exceed a limit of 6
    than of 14.

    But, say, a 32 GPR machine could still do well here.


    Note that there are reasons why I don't claim 64 GPRs as a large
    performance advantage:
    On programs like Doom, the difference is small at best.


    It mostly effects things like GLQuake in my case, mostly because
    TKRA-GL has a lot of functions with a large numbers of local
    variables (some exceeding 100 local variables).

    Partly though this is due to code that is highly inlined and
    unrolled and uses lots of variables tending to perform better in my
    case (and tightly looping code, with lots of small functions, not so
    much...).



    Where, function categories:
       Tiny Leaf:
         Everything fits in scratch registers, no stack frame, no calls. >>>>>>    Leaf:
         No function calls (either explicit or implicit);
         Will have a stack frame.
       Non-Leaf:
         May call functions, has a stack frame.

    You are forgetting about FP, GOT, TLS, and whatever resources are
    required
    to do try-throw-catch stuff as demanded by the source language.


    Yeah, possibly true.

    In my case:
       There is no frame pointer, as BGBCC doesn't use one;
         All stack-frames are fixed size, VLA's and alloca use the heap; >>>>    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR); >>>>    TLS, accessed via TBR.[...]

    alloca using the heap? Strange to me...


    Well, in this case:
    The alloca calls are turned into calls which allocate the memory blob
    and add it to a linked list;
    when the function returns, everything in the linked list is freed;
    Then, it internally pulls this off via malloc and free.

    Also the typical default stack size in this case is 128K, so trying to
    put big allocations on the stack is more liable to result in a stack
    overflow.

    Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
    heap allocation is not too slow in this case.


    Though, at the same time, ideally one limits use of language features
    where the code-generation degenerates into a mess of hidden runtime
    calls. These cases are not ideal for performance...



    Sometimes alloca is useful wrt offsetting the stack to avoid false
    sharing between stacks. Intel wrote a little paper that addresses this:

    https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf

    Remember that one?

    This seems mostly N/A in my case, as the cores use a weak memory model
    and there is no SMT.

    Also thread creation tends to offset stacks by a random amount as well
    as a form of ASLR (IIRC, roughly 0..256 bytes in a multiple of 16).


    As for reducing the cost of heap sharing between threads, there is
    another option here:
    Give each thread its own local version of the heap (essentially,
    per-thread free-lists and similar). This can avoid the need to use mutex locking and similar, though may have a penalty if one tries to free
    memory objects that weren't allocated in the same thread.

    In my case, the heap is split into multiple object sizes:
    Small:
    Under around 1K, allocated in terms of 16-byte cells;
    Allocated in chunks from the medium heap.
    Medium:
    Around 1K to 64K, allocated via subdividing a larger block (256K).
    Allocated via the large heap.
    Large:
    Bigger than 64K or so, allocated via pages (eg: "mmap()").

    For the most common sizes (small and medium), the free-list and similar
    could be thread local; global locking then being used for large
    allocation, or for allocating new heap blocks (for the medium heap).

    As can be noted, objects in the small object heap tend to be passed to a multiple of 16 bytes, and normally have a 16 byte object header (the
    pointer to an allocated object points just after this header).

    Note that objects in the large heap may instead store this metadata externally.


    Granted, yeah, mutex locking is fairly expensive with a weak memory
    model, and shared memory is generally undesirable as there is little in
    the way of memory coherence (absent explicit flushes, accessing memory belonging to a different thread may result in stale data).


    A similar trick was used in the past for my BGBScript VMs, mostly
    because mutex locking is slow; and dynamic languages like this tend to
    involve a lot of rapid-fire small granularity allocations and frees
    (every object and array goes on the heap).


    In BGBCC, it is merely just that large objects and arrays go on the heap (along with VLAs and similar). If one creates a lambda, this also goes
    on the heap.

    If one wants to support proper lexical closures (N/A for both C++ and
    Java style lambdas, *), the local variables may also end up on the heap.
    And, if one wanted to support Scheme style call/cc (call-with-current-continuation), the entire ABI frame needs to go on
    the heap. However, a decision was made early on not to bother with
    call/cc support in the BGBCC ABI (an ABI capable of supporting call/cc
    would impose a severe performance penalty).

    There was provision made for supporting exit-only continuations, which
    can effectively leverage the same mechanism as try/catch and throw (the continuation is effectively a self-throwing exception which will be
    caught at a predefined location).


    *: By default, lambdas in BGBCC do not use lexical capture, and instead
    either capture by-value or capture-by-reference (like in C++ style
    lambdas, or GCC inner functions, using C++ style syntax) but differ in
    that the lambdas are callable as normal function pointers (and
    heap-allocated, rather than RAII value-types, though will be auto-freed
    when the originating function returns in the capture-by-reference case).

    In my BGBScript2 language, capture-by-value had been the default, with
    lambdas having an indeterminate lifespan (they may continue to exist
    outside of the scope where the calling function had terminated; unlike
    GCC inner functions).

    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Wed Apr 10 16:56:51 2024
    From Newsgroup: comp.arch

    On 4/10/2024 12:29 PM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load
    instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.

    It does occupy some icache space, however; have you boosted the icache
    size to compensate?

    FWIW, in my case:
    32K I$ + 32K D$ does fairly well IME;

    16K I$ + 32K D$ works well for Doom, but has a fairly higher I$ miss
    rate for Quake and similar (and most other non-Doom programs). Vs, Doom seemingly being pretty much D$ bound.


    Constants are generally encoded inline in BJX2.

    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 10 23:30:02 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:


    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.

    It does occupy some icache space, however; have you boosted the icache
    size to compensate?

    The space occupied in the ICache is freed up from being in the DCache
    so the overall hit rate goes up !! At typical sizes, ICache miss rate
    is about ¼ the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD instruction and 1 or 2 words in DCache, while consuming a GPR. So, overall, it takes
    fewer cycles, fewer GPRs, and fewer instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you have no
    direct route to either 64-bit constants or 64-bit address spaces.

    It looks to be a win-win !!
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Wed Apr 10 22:18:25 2024
    From Newsgroup: comp.arch

    On 4/9/24 8:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:
    [snip]
    Things like memcpy/memmove/memset/etc, are function calls in
    cases when not directly transformed into register load/store
    sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a
    single instruction.

    I wonder if it would be useful to have an immediate count form of
    memory move. Copying fixed-size structures would be able to use an
    immediate. Aside from not having to load an immediate for such
    cases, there might be microarchitectural benefits to using a
    constant. Since fixed-sized copies would likely be limited to
    smaller regions (with the possible exception of 8 MiB page copies)
    and the overhead of loading a constant for large sizes would be
    tiny, only providing a 16-bit immediate form might be reasonable.

    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into
    a slide.

    MMs and MSs that do not cross page boundaries are ATOMIC. The
    entire system
    sees only the before or only the after state and nothing in
    between.

    I still feel that this atomicity should somehow be included with
    ESM just because they feel related, but the benefit seems likely
    to be extremely small. How often would software want to copy
    multiple regions atomically or combine region copying with
    ordinary ESM atomicity?? There *might* be some use for an atomic
    region copy and an updating of a separate data structure (moving a
    structure and updating one or a very few pointers??). For
    structures three cache lines in size where only one region
    occupies four cache lines, ordinary ESM could be used.

    My feeling based on "relatedness" is not a strong basis for such
    an architectural design choice.

    (Simple page masking would allow false conflicts when smaller
    memory moves are used. If there is a separate pair of range
    registers that is checked for coherence of memory moves, this
    issue would only apply for multiple memory moves _and_ all eight
    of the buffer entries could be used for smaller accesses.)

    [snip]
    As noted, on a 32 GPR machine, most leaf functions can fit
    entirely in scratch registers.

    Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without
    getting totally screwed.

    I wonder how many instructions would have to have access to such a
    set of "special registers" and if a larger number of extra
    registers would be useful. (One of the issues — in my opinion —
    with PowerPC's link register and count register was that they
    could not be directly loaded from or stored to memory [or loaded \
    with a constant from the instruction stream]. For counted loops,
    loading the count register from the instruction stream would
    presumably have allowed early branch determination even for deep
    pipelines and small loop counts.) SP, FP, GOT, and TLS hold
    "stable values", which might facilitate some microarchitectural
    optimizations compared to more frequently modified register names.

    (I am intrigued by the possibility of small contexts for some
    multithreaded workloads, similar to how some GPUs allow variable
    context sizes.)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Apr 10 22:14:33 2024
    From Newsgroup: comp.arch

    On 4/10/2024 4:19 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 4/10/2024 12:12 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load
    instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of
    a basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache
    pollution.


    Yeah.

    This was why some of the first things I did when I started extending
    SH-4 were:
    Adding mechanisms to build constants inline;
    Adding Load/Store ops with a displacement (albeit with encodings
    borrowed from SH-2A);
    Adding 3R and 3RI encodings (originally Imm8 for 3RI).

    My suggestion is that:: "Now that you have screwed around for a while,
    Why not take that experience and do a new ISA without any of those
    mistakes in it" ??


    There was a reboot, it became BJX2.
    This, of course, has developed some of its own hair...


    Where, BJX1 was a modified SuperH, and BJX2 was a redesigned ISA design
    that was "mostly backwards compatible" at the ASM level.

    Granted, possibly I could have gone further, such as no longer having
    the stack pointer in R15, but alas...


    Though, in some areas, SH had features that I had dropped as well, such
    as auto-increment addressing and delay slots.



    Did have a mess when I later extended the ISA to 32 GPRs, as (like
    with BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.


    Usually they were spilled between basic-blocks, with the basic-block
    needing to branch to the following basic-block in these cases.

    Also 8-bit branch displacements are kinda lame, ...

    Why do that to yourself ??


    I didn't design SuperH, Hitachi did...

    But you did not fix them en massé, and you complain about them
    at least once a week. There comes a time when it takes less time
    and less courage to do that big switch and clean up all that mess.


    For the most part, BJX2 is using 20-bit branches for 32-bit ops.

    Exceptions being the Compare-and-Branch, and Compare-Zero-and-Branch
    ops, but this is mostly because there wasn't enough encoding space to
    give them larger displacements.

    BREQ.Q Rn, Disp11s
    BREQ.Q Rm, Rn, Disp8s

    There are Disp32s variants available, just that these involve using a
    Jumbo prefix.




    But, with BJX1, I had added Disp16 branches.

    With BJX2, they were replaced with 20 bit branches. These have the
    merit of being able to branch anywhere within a Doom or Quake sized
    binary.


    And, if one wanted a 16-bit branch:
       MOV.W (PC, 4), R0  //load a 16-bit branch displacement
       BRA/F R0
       .L0:
       NOP    // delay slot
       .WORD $(Label - .L0)

    Also kinda bad...

    Can you say Yech !!


    Yeah.
    This sort of stuff created strong incentive for ISA redesign...

    Maybe consider now as the appropriate time to strt.


    The above was for SuperH; this sort of thing is N/A for BJX2.

    In this case, BJX2 can pull it off in a single instruction.


    None the less, even with all this crap, the SuperH was still seen as sufficient for the Sega 32X/Saturn/Dreamcast (and the Naomi and Hikaru
    arcade machine boards, ...).

    Though, it seems Sega later jumped ship from SuperH to using low-end x86
    PC motherboads in later arcade machines.


    Granted, it is possible had I instead started with RISC-V instead of
    SuperH, it is probable BJX2 wouldn't exist.


    Though, at the time, the original thinking was that SuperH having
    smaller instructions meant it would have better code density than
    RV32I or similar. Turns out not really, as the penalty of the 16 bit
    ops was needing almost twice as many on average.

    My 66000 only requires 70% the instruction count of RISC-V,
    Yours could too ................



    At this point, I suspect the main issue for me not (entirely) beating
    RV64G, is mostly compiler issues...


    So, the ".text" section is still around 10% bigger, with some amount of
    this being spent on Jumbo prefixes, and the rest due to cases where code generation falls short.


    Things like memcpy/memmove/memset/etc, are function calls in cases >>>>>> when not directly transformed into register load/store sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a single >>>>> inst-
    ruction.


    I have no high-level memory move/copy/set instructions.
    Only loads/stores...

    You have the power to fix it.........


    But, at what cost...

    You would not have to spend hours a week defending the indefensible !!

    I had generally avoided anything that will have required microcode or
    shoving state-machines into the pipeline or similar.

    Things as simple as IDIV and FDIV require sequencers.
    But LDM, STM, MM require sequencers simpler than IDIV and FDIV !!


    Not so much in my case.

    IDIV and FDIV:
    Feed inputs into Shift-Add unit;
    Stall pipeline for a predefined number of clock cycles;
    Grab result out of the other end (at which point, pipeline resumes).

    In this case, the FDIV was based on noting that if one lets the
    Shift-Add unit run for longer, it moves from doing an integer divide to
    doing a fractional divide, so I could make it perform an FDIV merely by feeding the mantissas into it (as two big integers) and doubling the
    latency. Then glue on some extra logic to figure out the exponents and pack/unpack Binary64, and, done.


    Not really the same thing at all...


    Apart from it tending to get stomped every time one does an integer
    divide, could possibly also use it as an RNG, as it basically churns
    over whatever random bits flow into it from the pipeline.


    Things like Load/Store-Multiple or

    If you like polluted ICaches..............

    For small copies, can encode them inline, but past a certain size
    this becomes too bulky.

    A copy loop makes more sense for bigger copies, but has a high
    overhead for small to medium copy.


    So, there is a size range where doing it inline would be too bulky,
    but a loop caries an undesirable level of overhead.

    All the more reason to put it (a highly useful unit of work) into an
    instruction.


    This is an area where "slides" work well, the main cost is mostly the
    bulk that the slide adds to the binary (albeit, it is one-off).

    Consider that the predictor getting into the slide the first time
    always mispredicts !!


    Possibly.

    But, note that the paths headed into the slide are things like structure assignment and "memcpy()" where the size is constant. So, in these
    cases, the compiler already knows where it is branching.

    So, say:
    memcpy(dst, src, 512);
    Gets compiled as, effectively:
    MOV dst, R4
    MOV src, R5
    BSR __memcpy64_512_ua


    Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...

    What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably, yet a HW sequencer only has to avoid asserting a single byte write enable once.


    Two strategies:
    Compiler pads it to 64 bytes (typical for struct copy, where structs can always be padded up to their natural alignment);
    It emits the code for copying the last N bytes (modulo 32) and then
    branches into the slide (typical for memcpy).


    For variable memcpy, there is an extension:
    _memcpyf(void *dst, void *src, size_t len);

    Which is basically the "I don't care if it copies a little extra"
    version (say, where it may pad the copy up to a multiple of 16 bytes).


    For looping memcpy, it makes sense to copy 64 or 128 bytes per loop
    iteration or so to try to limit looping overhead.

    On low end machines, you want to operate at cache port width,
    On high end machines, you want to operate at cache line widths per port.
    This is essentially impossible using slides.....here, the same code is
    not optimal across a line of implementations.


    Possible.

    As is, it uses 64-bit load/store for unaligned copy, and 128-bit for
    aligned copy (support for unaligned "MOV.X" is still an optional feature).

    It mostly doesn't bother trying to sort this out for the slide, as for
    the size ranges dealt with by the slide, trying to separate aligned from unaligned at runtime will end up costing about as much as it saves.


    Though, for larger copies, it makes more sense to figure it out.


    Though, leveraging the memcpy slide for the interior part of the copy
    could be possible in theory as well.

    What do you do when the STAT drive wants to write a whole page ??


    ?...

    Presumably there aren't going to be that many pages being paged out mid-memcpy.


    For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot
    shorter (a big part of LZ decoder performance mostly being in
    fine-tuning the logic for the match copies).

    Though, this is part of why my runtime library had added a
    "_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
    which can consolidate this rather than needing to do it one-off for
    each LZ decoder (as I see it, it is a similar issue to not wanting
    code to endlessly re-roll stuff for functions like memcpy or
    malloc/free, *).


    *: Though, nevermind that the standard C interface for malloc is
    annoyingly minimal, and ends up requiring most non-trivial programs to
    roll their own memory management.


    Ended up doing these with "slides", which end up eating roughly
    several kB of code space, but was more compact than using larger
    inline copies.


    Say (IIRC):
       128 bytes or less: Inline Ld/St sequence
       129 bytes to 512B: Slide
       Over 512B: Call "memcpy()" or similar.

    Versus::
         1-infinity: use MM instruction.


    Yeah, but it makes the CPU logic more expensive.

    By what, 37-gates ??


    I will assume it is probably a bit more than this given there is not
    currently any sort of mechanism that does anything similar.

    Would need to add some sort of "inject synthesized instructions into the pipeline" mechanism, my guess is this would probably be at least a few
    kLUT. Well, unless it is put in ROM, but this would have no real
    advantage over "just do it in software".


    FWIW:
    I had originally intended to put a page-table walker in ROM and then
    pretend like it had a hardware page-walker, but we all know how this
    turned out.

    Though, part of this was because it was competing against arguably more
    useful uses of ROM space, like the FAT driver, PE/COFF and ELF loaders,
    and the boot-time sanity checks (eg: verify early on that I hadn't
    broken fundamental parts of the CPU).



    The slide generally has entry points in multiples of 32 bytes, and
    operates in reverse order. So, if not a multiple of 32 bytes, the
    last bytes need to be handled externally prior to branching into the
    slide.

    Does this remain sequentially consistent ??


    Within a thread, it is fine.

    What if a SATA drive is reading while you are writing !!
    That is, DMA is no different than multi-threaded applications--except
    DMA cannot perform locks.


    Currently there is no DMA, only polling IO.
    Also no SATA interface, nor PCIE, nor ...

    IO to an SDcard is basically probing the MMIO interface and spinning in
    a loop until it is done. Most elaborate part of this interface is that
    there was a mechanism added to allow sending/recieving 8 bytes at a time
    over SPI.


    Main wonk is that it does start copying from the high address first.
    Presumably interrupts or similar wont be messing with application
    memory mid memcpy.

    The only things wanting high-low access patterns are dumping stuff to
    the stack. The fact you CAN get away with it most of the time is no excuse.


    AFAIK, these is no particular requirement for which direction "memcpy()"
    goes.

    And, high to low was more effective for the copy slide.



    The looping memcpy's generally work from low to high addresses though.

    As does all string processing.

    Granted.

    The string handling functions are their own piles of fun...



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Apr 10 22:21:33 2024
    From Newsgroup: comp.arch

    On 4/10/2024 9:18 PM, Paul A. Clayton wrote:
    On 4/9/24 8:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:
    [snip]
    Things like memcpy/memmove/memset/etc, are function calls in cases
    when not directly transformed into register load/store sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a single
    instruction.

    I wonder if it would be useful to have an immediate count form of
    memory move. Copying fixed-size structures would be able to use an
    immediate. Aside from not having to load an immediate for such
    cases, there might be microarchitectural benefits to using a
    constant. Since fixed-sized copies would likely be limited to
    smaller regions (with the possible exception of 8 MiB page copies)
    and the overhead of loading a constant for large sizes would be
    tiny, only providing a 16-bit immediate form might be reasonable.


    As noted, in my case, the whole thing of Ld/St sequences, and memcpy
    slides, mostly applies to constant cases.

    If the copy size is variable, the compiler merely calls "memcpy()",
    which will then generally figure out which loop to use, and one has to
    pay the penalty of the runtime overhead of memcpy needing to figure out
    what it needs to do.


    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into a
    slide.

    MMs and MSs that do not cross page boundaries are ATOMIC. The entire
    system
    sees only the before or only the after state and nothing in between.

    I still feel that this atomicity should somehow be included with
    ESM just because they feel related, but the benefit seems likely
    to be extremely small. How often would software want to copy
    multiple regions atomically or combine region copying with
    ordinary ESM atomicity?? There *might* be some use for an atomic
    region copy and an updating of a separate data structure (moving a
    structure and updating one or a very few pointers??). For
    structures three cache lines in size where only one region
    occupies four cache lines, ordinary ESM could be used.

    My feeling based on "relatedness" is not a strong basis for such
    an architectural design choice.

    (Simple page masking would allow false conflicts when smaller
    memory moves are used. If there is a separate pair of range
    registers that is checked for coherence of memory moves, this
    issue would only apply for multiple memory moves _and_ all eight
    of the buffer entries could be used for smaller accesses.)


    All seems a bit complicated to me.

    But, as noted, I went for a model of weak memory coherence and leaving
    most of this stuff for software to sort out.


    [snip]
    As noted, on a 32 GPR machine, most leaf functions can fit entirely
    in scratch registers.

    Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without
    getting totally screwed.

    I wonder how many instructions would have to have access to such a
    set of "special registers" and if a larger number of extra
    registers would be useful. (One of the issues — in my opinion —
    with PowerPC's link register and count register was that they
    could not be directly loaded from or stored to memory [or loaded \
    with a constant from the instruction stream]. For counted loops,
    loading the count register from the instruction stream would
    presumably have allowed early branch determination even for deep
    pipelines and small loop counts.) SP, FP, GOT, and TLS hold
    "stable values", which might facilitate some microarchitectural
    optimizations compared to more frequently modified register names.

    (I am intrigued by the possibility of small contexts for some
    multithreaded workloads, similar to how some GPUs allow variable context sizes.)

    In my case, yeah, there are two semi-separate register spaces here:
    GPRs: R0..R63
    R0, R1, and R15 are Special
    R0/DLR: Hard-coded register for some instructions;
    Assembler may stomp without warning for pseudo-instructions.
    R1/DHR:
    Was originally intended similar to DLR;
    Now mostly used as an auxiliary link register.
    R15/SP:
    Stack Pointer.
    CRs: C0..C63
    Various special purpose registers;
    Most are privileged only.
    LR, GBR, etc, are in CR space.


    Though, internally, GPRs and CRs both exist within a combined register
    space in the CPU:
    00..3F: Mostly GPR space
    40..7F: CR and SPR space.

    Generally, CRs may only be accessed by certain register ports though.


    By default, the only way to save/restore CRs is by shuffling them
    through GPRs. There is an optional MOV.C instruction for this, but
    generally it is not enabled as it isn't clear that it saves enough to be
    worth the added LUT cost.

    There is a subset version, where MOV.C exists, but is only really able
    to be used with LR and GBR and similar. Generally, this version exists
    as RISC-V Mode needs to be able to save/restore these registers (they
    exist in the GPR space in RISC-V).


    As I can note, if I did a new ISA, most likely the register assignment
    scheme would differ, say:
    R0: ZR / PC
    R1: LR / TP (TBR)
    R2: SP
    R3: GP (GBR)
    Where the interpretation of R0 and R1 would depend on context (ZR and LR
    for most instructions, PC and TP when used as a Ld/St base address).


    Though, some ideas had involved otherwise keeping a similar register
    space layout to my existing ABI, mostly because significant ABI changes
    would not be easy for my compiler as-is.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Apr 11 12:22:47 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load
    instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.

    It does occupy some icache space, however; have you boosted the icache
    size to compensate?

    Except it pretty rarely do so (increase icache pressure):

    mov temp_reg, offset const_table
    mov reg,qword ptr [temp_reg+const_offset]

    looks to me like at least 5 bytes for the first instruction and probably
    6 for the second, for a total of 11 (could be as low as 8 for a very
    small offset), all on top of the 8 bytes of dcache needed to hold the
    64-bit value loaded.

    In My 66000 this should be a single 32-bit instruction followed by the
    8-byte const, so 12 bytes total and no lookaside dcache inference.

    It is only when you do a lot of 64-bit data loads, all gathered in a
    single 256-byte buffer holding up to 32 such values, and you can afford
    to allocate a fixed register pointing to the middle of that range, that
    you actually gain some total space: Each load can now just do a

    mov reg,qword ptr [fixed_base_reg+byte_offset]

    which, due to the need for a 64-bit prefix, will probably need 4
    instruction bytes on top of the 8 bytes from dcache. At this point we
    are touching exactly the same number of bytes (12) as My 66000, but from
    two different caches, so much more likley to suffer dcache misses.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Apr 11 14:13:24 2024
    From Newsgroup: comp.arch

    On Wed, 10 Apr 2024 23:30:02 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:


    In My 66000 case, the constant is the word following the
    instruction. Easy to find, easy to access, no register pollution,
    no DCache pollution.

    It does occupy some icache space, however; have you boosted the
    icache size to compensate?

    The space occupied in the ICache is freed up from being in the DCache
    so the overall hit rate goes up !! At typical sizes, ICache miss rate
    is about ¼ the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD
    instruction and 1 or 2 words in DCache, while consuming a GPR. So,
    overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you have
    no direct route to either 64-bit constants or 64-bit address spaces.

    It looks to be a win-win !!
    Win-win under constraints of Load-Store Arch. Otherwise, it depends.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 11 14:30:27 2024
    From Newsgroup: comp.arch

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 4/9/24 8:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:
    [snip]
    Things like memcpy/memmove/memset/etc, are function calls in
    cases when not directly transformed into register load/store
    sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a
    single instruction.

    I wonder if it would be useful to have an immediate count form of
    memory move. Copying fixed-size structures would be able to use an
    immediate. Aside from not having to load an immediate for such
    cases, there might be microarchitectural benefits to using a
    constant. Since fixed-sized copies would likely be limited to
    smaller regions (with the possible exception of 8 MiB page copies)
    and the overhead of loading a constant for large sizes would be
    tiny, only providing a 16-bit immediate form might be reasonable.

    It seems to me that an offloaded DMA engine would be a far
    better way to do memmove (over some threshhold, perhaps a
    cache line) without trashing the caches. Likewise memset.



    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into
    a slide.

    MMs and MSs that do not cross page boundaries are ATOMIC. The
    entire system
    sees only the before or only the after state and nothing in
    between.

    One might wonder how that atomicity is guaranteed in a
    SMP processor...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Thu Apr 11 13:35:41 2024
    From Newsgroup: comp.arch

    On 4/11/2024 6:13 AM, Michael S wrote:
    On Wed, 10 Apr 2024 23:30:02 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:


    In My 66000 case, the constant is the word following the
    instruction. Easy to find, easy to access, no register pollution,
    no DCache pollution.

    It does occupy some icache space, however; have you boosted the
    icache size to compensate?

    The space occupied in the ICache is freed up from being in the DCache
    so the overall hit rate goes up !! At typical sizes, ICache miss rate
    is about ¼ the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD
    instruction and 1 or 2 words in DCache, while consuming a GPR. So,
    overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you have
    no direct route to either 64-bit constants or 64-bit address spaces.

    It looks to be a win-win !!

    Win-win under constraints of Load-Store Arch. Otherwise, it depends.


    FWIW:
    The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
    and needs less encoding space than the LUI route.

    MOV Imm16. Rn
    SHORI Imm16, Rn
    SHORI Imm16, Rn
    SHORI Imm16, Rn

    Granted, if each is a 1-cycle instruction, this still takes 4 clock cycles.

    An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
    1-cycle, is preferable....




    In misc news:

    Some compiler fiddling has now dropped the ".text" overhead (vs RV64G)
    from 10% to 5%.

    This was mostly in the form of adding dependency tracking logic to ASM
    code (albeit in a form where it needs to use ".global" and ".extern" statements for things to work correctly), and no longer giving it a free
    pass.

    This in turn allowed it to effectively cull some parts of the dynamic typesystem runtime and a bunch of the Binary128 support code (shaving
    roughly 14K off of the Doom build).

    Does have a non-zero code impact (mostly in the form of requiring adding ".global" and ".extern" lines to the ASM code in some cases where they
    were absent).


    Looks like a fair chunk of the dynamic types runtime is still present
    though, which appears to be culled in the GCC build (since GCC doesn't
    use the dynamic typesystem at all). Theoretically, Doom should not need
    it, as Doom is entirely "plain old C".

    Main part that ended up culled with this change was seemingly most of
    the code for ex-nihilo objects and similar (which does not seem to be reachable from any of the Doom code).

    There is a printf extension for printing variant types, but this is
    still present in the RV64G build (this would mostly include code needed
    for the "toString" operation). I guess, one could debate whether printf actually needs support for variant types (as can be noted, most normal C
    code will not use it).

    Though, I guess one option could be to modify it to call toString via a function pointer which is only set if other parts of the dynamic
    typesystem are initialized (could potentially save several kB off the
    size of the binary it looks like). Might break stuff though if one ties
    to printf a variant but had not used any types much beyond fixnum and
    flonum, which would not have triggered the typesystem to initialize itself.

    Probably doesn't matter too much, as this code is not likely a factor in
    the delta between the ISAs.


    Note that if the size of Doom's ".text" section dropped by another 15K,
    it would reach parity with the RV64G build (which was around 290K in the relevant build ATM; goal being to keep the code fairly close to parity
    in this case, with the differences mostly allowed for ISA specific stuff).

    Though, this is ignoring that roughly 11K of this delta are Jumbo
    prefixes (so the delta in instruction count is now roughly 1.3% at the moment); and RV64G has an additional 24K in its ".rodata" section
    (beyond what could be accounted for in string literals and similar).


    So, in terms of text+rodata (+strtab *), my stuff is smaller at the moment.

    *: Where GCC rolls its string literals into '.rodata', vs BGBCC having a dedicated section for string literals.

    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 11 18:46:54 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/11/2024 6:13 AM, Michael S wrote:
    On Wed, 10 Apr 2024 23:30:02 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    It does occupy some icache space, however; have you boosted the
    icache size to compensate?

    The space occupied in the ICache is freed up from being in the DCache
    so the overall hit rate goes up !! At typical sizes, ICache miss rate
    is about ¼ the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD
    instruction and 1 or 2 words in DCache, while consuming a GPR. So,
    overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you have
    no direct route to either 64-bit constants or 64-bit address spaces.

    It looks to be a win-win !!

    Win-win under constraints of Load-Store Arch. Otherwise, it depends.

    Never seen a LD-OP architecture where the inbound memory can be in the
    Rs1 position of the instruction.



    FWIW:
    The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
    and needs less encoding space than the LUI route.

    MOV Imm16. Rn
    SHORI Imm16, Rn
    SHORI Imm16, Rn
    SHORI Imm16, Rn

    Granted, if each is a 1-cycle instruction, this still takes 4 clock cycles.

    As compared to::

    CALK Rd,Rs1,#imm64

    Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
    of the constant is free !! (0 cycles) !! {{The above example uses at least
    5 cycles to use the loaded/built constant.}}

    An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and 1-cycle, is preferable....

    A consuming instruction where you don't even use a register is better
    still !!
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Thu Apr 11 15:42:59 2024
    From Newsgroup: comp.arch

    On 4/11/2024 1:46 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/11/2024 6:13 AM, Michael S wrote:
    On Wed, 10 Apr 2024 23:30:02 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    It does occupy some icache space, however; have you boosted the
    icache size to compensate?

    The space occupied in the ICache is freed up from being in the DCache
    so the overall hit rate goes up !! At typical sizes, ICache miss rate
    is about ¼ the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD
    instruction and 1 or 2 words in DCache, while consuming a GPR. So,
    overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you have
    no direct route to either 64-bit constants or 64-bit address spaces.

    It looks to be a win-win !!

    Win-win under constraints of Load-Store Arch. Otherwise, it depends.

    Never seen a LD-OP architecture where the inbound memory can be in the
    Rs1 position of the instruction.



    FWIW:
    The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
    and needs less encoding space than the LUI route.

       MOV Imm16. Rn
       SHORI Imm16, Rn
       SHORI Imm16, Rn
       SHORI Imm16, Rn

    Granted, if each is a 1-cycle instruction, this still takes 4 clock
    cycles.

    As compared to::

        CALK   Rd,Rs1,#imm64

    Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
    of the constant is free !! (0 cycles) !! {{The above example uses at least
    5 cycles to use the loaded/built constant.}}


    The main reason one might want SHORI is that it can fit into a
    fixed-length 32-bit encoding. Also technically could be retrofitted onto RISC-V without any significant change, unlike some other options (as
    noted, I don't argue for adding Jumbo prefixes to RV under the basis
    that there is no real viable way to add them to RV, *).

    Sadly, the closest option to viable for RV would be to add the SHORI instruction and optionally pattern match it in the fetch/decode.

    Or, say:
    LUI Xn, Imm20
    ADD Xn, Xn, Imm12
    SHORI Xn, Imm16
    SHORI Xn, Imm16

    Then, combine LUI+ADD into a 32-bit load in the decoder (though probably
    only if the Imm12 is positive), and 2x SHORI into a combined "Xn=(Xn<<32)|Imm32" operation.

    This could potentially get it down to 2 clock cycles.



    *: To add a jumbo prefix, one needs an encoding that:
    Uses up a really big chunk of encoding space;
    Is otherwise illegal and unused.
    RISC-V doesn't have anything here.


    Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
    space that aren't yet used for anything, but aren't usable as normal
    encoding space mostly because if I put instructions in there (with the existing encoding schemes), I couldn't use all the registers (and they
    would not have predication or similar either). Annoyingly, the only
    types of encodings that would fit in there at present are 2RI Imm16 ops
    or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
    encodings for R0..R31 anyways, interpreting the LSB of the register
    field as encoding R32..R63).

    Though, 14x of these spaces would likely be alternate forms of Jumbo
    prefix (with another 14 in unconditional-scalar-op land). No immediate
    need to re-add an equivalent of the 40x2 encoding (from Baseline mode),
    as most of what 40x2 addressed can be encoded natively in XG2 Mode.


    Technically, I also have 2 unused bits in the Imm16 ops as well in XG2
    Mode. I "could" in theory, if I wanted, use them to extend the:
    MOV Imm17s, Rn
    Case, to:
    MOV Imm19s, Rn
    Though, the other option is to leave them reserved if I later want more
    Imm16 ops.

    For now, current plan is to leave this stuff as reserved.


    An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
    1-cycle, is preferable....

    A consuming instruction where you don't even use a register is better
    still !!


    Can be done, but thus far 33-bit immediate values. Luckily, Imm33s seems
    to addresses around 99% of uses (for normal ALU ops and similar).

    Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
    or 2x S.E8.F19), which would have indirectly allowed the Imm57s case. By themselves though, the difference doesn't seem enough to justify the cost.

    Don't have enough bits in the encoding scheme to pull off a 3RI Imm64 in
    12 bytes (and allowing a 16-byte encoding would have too steep of a cost increase to be worthwhile).

    So, alas...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Thu Apr 11 16:25:54 2024
    From Newsgroup: comp.arch

    On 4/11/2024 9:30 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 4/9/24 8:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:
    [snip]
    Things like memcpy/memmove/memset/etc, are function calls in
    cases when not directly transformed into register load/store
    sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a
    single instruction.

    I wonder if it would be useful to have an immediate count form of
    memory move. Copying fixed-size structures would be able to use an
    immediate. Aside from not having to load an immediate for such
    cases, there might be microarchitectural benefits to using a
    constant. Since fixed-sized copies would likely be limited to
    smaller regions (with the possible exception of 8 MiB page copies)
    and the overhead of loading a constant for large sizes would be
    tiny, only providing a 16-bit immediate form might be reasonable.

    It seems to me that an offloaded DMA engine would be a far
    better way to do memmove (over some threshhold, perhaps a
    cache line) without trashing the caches. Likewise memset.


    Probably.
    One could argue that, likely, setting up a DMA'ed memmove would be
    expensive enough to make it impractical for small copies (in the
    category where I am using inline Ld/St sequences or slides).

    And, larger copies (where it is most likely to bring benefit) at present mostly seem to be bus/memory bound.



    Sort of reminds me of the thing with the external rasterizer module:
    The module itself draws stuff quickly, but getting it set-up this far is
    still expensive enough to limit its benefit. So the main benefit it
    could bring is seemingly just using it to pull off multi-textured
    lightmap rendering, which in this case can run at similar speeds to
    vertex lighting (lightmapped rendering being a somewhat slower option
    for the software rasterizer).

    Well, along with me recently realizing a trick to mimic the look of
    trilinear filtering without increasing the number of texture fetches
    (mostly by distorting the interpolation coords, *). This trick could potentially be added to the rasterizer module.

    *: Traditional bilinear needs 4 texel fetches and 3 lerps (or, a poor
    man's approximation with 3 fetches and 2 lerps). Traditional trilinear
    needs 8 fetches and 7 lerps. The "cheap trick" version only needing the
    same as bilinear.

    One thing that is still needed is a good, fast, and semi-accurate way to
    pull off the Z=1.0/Z' calculation, as needed for perspective-correct rasterization (affine requires subdivision, which adds cost to the
    front-end, and interpolating Z directly adds significant distortion for geometry near the near plane).

    Granted, this would almost seem to create a need for an OpenGL
    implementation designed around the assumption of a hardware rasterizer
    module rather than software span drawing.


    Rasterizer module also has its own caching, where it sometimes may be
    needed to signal it to perform a cache flush (such as when updating the contents of a texture, or needing to access the framebuffer for some
    other reason, ...).

    Potentially, the module could be used to copy/transform images in a framebuffer (such as for GUI rendering), but would need to be somewhat generalized for this (such as supporting using non-power-of-2
    raster-images as textures).

    Though, another possibility could be adding a dedicated DMA module, or DMA+Image module, or glue dedicated DMA and Raster-Copy functionality
    onto the rasterizer module (as a separate thing from its normal "walk
    edges and blend pixels" functionality).




    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into
    a slide.

    MMs and MSs that do not cross page boundaries are ATOMIC. The
    entire system
    sees only the before or only the after state and nothing in
    between.

    One might wonder how that atomicity is guaranteed in a
    SMP processor...


    Dunno there.

    My stuff doesn't guerantee atomicity in general.

    Only way to ensure that both parties agree on the contents of memory, is
    that both need to flush their L1 caches or similar.

    Or use "No Cache" memory accesses, which is basically implemented as the
    L1 cache auto-flushing the line as soon as the request finishes; for
    good effect one also needs to add a few NOPs after the memory access to
    be sure the L1 has a chance to auto-flush it. Though, another
    possibility could be to add dedicated non-caching memory access
    instructions.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 11 23:06:05 2024
    From Newsgroup: comp.arch

    BGB-Alt wrote:

    On 4/11/2024 1:46 PM, MitchAlsup1 wrote:
    BGB wrote:


    Win-win under constraints of Load-Store Arch. Otherwise, it depends.

    Never seen a LD-OP architecture where the inbound memory can be in the
    Rs1 position of the instruction.



    FWIW:
    The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
    and needs less encoding space than the LUI route.

       MOV Imm16. Rn
       SHORI Imm16, Rn
       SHORI Imm16, Rn
       SHORI Imm16, Rn

    Granted, if each is a 1-cycle instruction, this still takes 4 clock
    cycles.

    As compared to::

        CALK   Rd,Rs1,#imm64

    Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
    of the constant is free !! (0 cycles) !! {{The above example uses at least >> 5 cycles to use the loaded/built constant.}}


    The main reason one might want SHORI is that it can fit into a
    fixed-length 32-bit encoding.

    While 32-bit encoding is RISC mantra, it has NOT been shown to be best
    just simplest. Then, once you start widening the microarchitecture, it
    is better to fetch wider than decode-issue so that you suffer least
    from boundary conditions. Once you start fetching wide OR have wide decode-issue, you have ALL the infrastructure to do variable length instructions. Thus, complaining that VLE is hard has already been
    eradicated.

    Also technically could be retrofitted onto RISC-V without any significant change, unlike some other options (as
    noted, I don't argue for adding Jumbo prefixes to RV under the basis
    that there is no real viable way to add them to RV, *).

    The issue is that once you do VLE RISC-Vs ISA is no longer helping you
    get the job done, especially when you have to execute 40% more instructions

    Sadly, the closest option to viable for RV would be to add the SHORI instruction and optionally pattern match it in the fetch/decode.

    Or, say:
    LUI Xn, Imm20
    ADD Xn, Xn, Imm12
    SHORI Xn, Imm16
    SHORI Xn, Imm16

    Then, combine LUI+ADD into a 32-bit load in the decoder (though probably only if the Imm12 is positive), and 2x SHORI into a combined "Xn=(Xn<<32)|Imm32" operation.

    This could potentially get it down to 2 clock cycles.

    Universal constants gets this down to 0 cycles......

    *: To add a jumbo prefix, one needs an encoding that:
    Uses up a really big chunk of encoding space;
    Is otherwise illegal and unused.
    RISC-V doesn't have anything here.

    Which is WHY you should not jump ship from SH to RV, but jump to an
    ISA without these problems.

    Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
    space that aren't yet used for anything, but aren't usable as normal encoding space mostly because if I put instructions in there (with the existing encoding schemes), I couldn't use all the registers (and they
    would not have predication or similar either). Annoyingly, the only
    types of encodings that would fit in there at present are 2RI Imm16 ops
    or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
    encodings for R0..R31 anyways, interpreting the LSB of the register
    field as encoding R32..R63).

    Just another reason not to stay with what you have developed.

    In comparison, I reserve 6-major OpCodes so that a control transfer into
    data is highly likely to get Undefined OpCode exceptions rather than a
    try to execute what is in that data. Then, as it is, I still have 21-slots
    in the major OpCode group free (27 if you count the permanently reserved).

    Much of this comes from side effects of Universal Constants.


    An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
    1-cycle, is preferable....

    A consuming instruction where you don't even use a register is better
    still !!


    Can be done, but thus far 33-bit immediate values. Luckily, Imm33s seems
    to addresses around 99% of uses (for normal ALU ops and similar).

    What do you do when accessing data that the linker knows is more than 4GB
    away from IP ?? or known to be outside of 0-4GB ?? externs, GOT, PLT, ...

    Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
    or 2x S.E8.F19), which would have indirectly allowed the Imm57s case. By themselves though, the difference doesn't seem enough to justify the cost.

    While I admit that <basically> anything bigger than 50-bits will be fine
    as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.

    Don't have enough bits in the encoding scheme to pull off a 3RI Imm64 in
    12 bytes (and allowing a 16-byte encoding would have too steep of a cost increase to be worthwhile).

    And yet I did.

    So, alas...

    Yes, alas..........
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 11 23:12:25 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 4/9/24 8:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:
    [snip]
    Things like memcpy/memmove/memset/etc, are function calls in
    cases when not directly transformed into register load/store
    sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a
    single instruction.

    I wonder if it would be useful to have an immediate count form of
    memory move. Copying fixed-size structures would be able to use an >>immediate. Aside from not having to load an immediate for such
    cases, there might be microarchitectural benefits to using a
    constant. Since fixed-sized copies would likely be limited to
    smaller regions (with the possible exception of 8 MiB page copies)
    and the overhead of loading a constant for large sizes would be
    tiny, only providing a 16-bit immediate form might be reasonable.

    It seems to me that an offloaded DMA engine would be a far
    better way to do memmove (over some threshhold, perhaps a
    cache line) without trashing the caches. Likewise memset.

    Effectively, that is what HW does, even on the lower end machines,
    the AGEN unit of the Cache access pipeline is repeatedly cycled,
    and data is read and/or written. One can execute instructions not
    needing memory references while LDM, STM, ENTER, EXIT, MM, and MS
    are in progress.

    Moving this sequencer farther out would still require it to consume
    all L1 BW in any event (snooping) for memory consistency reasons.
    {Note: cache accesses are performed line-wide not register width wide}


    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into
    a slide.

    MMs and MSs that do not cross page boundaries are ATOMIC. The
    entire system
    sees only the before or only the after state and nothing in
    between.

    One might wonder how that atomicity is guaranteed in a
    SMP processor...

    The entire chunk of data traverses the interconnect as a single
    transaction. All interested 3rd parties (not originator nor
    recipient) see either the memory state before the transfer or
    after the transfer.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Apr 12 02:19:04 2024
    From Newsgroup: comp.arch

    On Thu, 11 Apr 2024 18:46:54 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On 4/11/2024 6:13 AM, Michael S wrote:
    On Wed, 10 Apr 2024 23:30:02 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    It does occupy some icache space, however; have you boosted the
    icache size to compensate?

    The space occupied in the ICache is freed up from being in the
    DCache so the overall hit rate goes up !! At typical sizes,
    ICache miss rate is about the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD
    instruction and 1 or 2 words in DCache, while consuming a GPR. So,
    overall, it takes fewer cycles, fewer GPRs, and fewer
    instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you
    have no direct route to either 64-bit constants or 64-bit address
    spaces.

    It looks to be a win-win !!

    Win-win under constraints of Load-Store Arch. Otherwise, it
    depends.

    Never seen a LD-OP architecture where the inbound memory can be in
    the Rs1 position of the instruction.

    May be. But out of 6 major integer OPs it matters only for SUB.
    By now I don't remember for sure, but I think that I had seen LD-OP architecture that had SUBR instruction. May be, TI TMS320C30?
    It was 30 years ago and my memory is not what it used to be.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 11 23:22:16 2024
    From Newsgroup: comp.arch

    BGB-Alt wrote:

    On 4/11/2024 9:30 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:


    One thing that is still needed is a good, fast, and semi-accurate way to pull off the Z=1.0/Z' calculation, as needed for perspective-correct rasterization (affine requires subdivision, which adds cost to the front-end, and interpolating Z directly adds significant distortion for geometry near the near plane).

    I saw a 10-cycle latency 1-cycle throughput divider at Samsumg::
    10 stages of 3-bit at a time SRT divider with some exponent stuff
    on the side. 1.0/z is a lot simpler than that (float only). A lot
    of these great big complicated calculations can be beaten into
    submission with a clever attack of brute force HW.....FMUL and FMAC
    being the most often cited cases.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Thu Apr 11 20:07:08 2024
    From Newsgroup: comp.arch

    On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 4/11/2024 1:46 PM, MitchAlsup1 wrote:
    BGB wrote:


    Win-win under constraints of Load-Store Arch. Otherwise, it depends.

    Never seen a LD-OP architecture where the inbound memory can be in
    the Rs1 position of the instruction.



    FWIW:
    The LDSH / SHORI mechanism does provide a way to get 64-bit
    constants, and needs less encoding space than the LUI route.

       MOV Imm16. Rn
       SHORI Imm16, Rn
       SHORI Imm16, Rn
       SHORI Imm16, Rn

    Granted, if each is a 1-cycle instruction, this still takes 4 clock
    cycles.

    As compared to::

         CALK   Rd,Rs1,#imm64

    Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
    of the constant is free !! (0 cycles) !! {{The above example uses at
    least
    5 cycles to use the loaded/built constant.}}


    The main reason one might want SHORI is that it can fit into a
    fixed-length 32-bit encoding.

    While 32-bit encoding is RISC mantra, it has NOT been shown to be best
    just simplest. Then, once you start widening the microarchitecture, it
    is better to fetch wider than decode-issue so that you suffer least from boundary conditions. Once you start fetching wide OR have wide
    decode-issue, you have ALL the infrastructure to do variable length instructions. Thus, complaining that VLE is hard has already been
    eradicated.


    As noted, BJX2 is effectively VLE.
    Just now split into two sub-variants.

    So, as for lengths:
    Baseline: 16/32/64/96
    XG2: 32/64/96
    Original version was 16/32/48.


    But, the original 48-bit encoding was dropped, mostly to make the rest
    of the encoding more orthogonal, and these were replaced with Jumbo
    prefixes. An encoding space exists where 48-bit ops could in theory be re-added to Baseline, but have not done so as it does not seem be
    justifiable in a cost/benefit sense (and would still have some of the
    same drawbacks as the original 48 bit ops).

    Had also briefly experimented with 24-bit ops, but these were quickly
    dropped due to "general suckage" (though, an alternate 16/24/32/48
    encoding scheme could have theoretically given better code-density).


    However, RISC-V is either 32-bit, or 16/32.

    For now, I am not bothering with the 16-bit C extension, not so much for
    sake of difficulty of dealing with VLE (the core can already deal with
    VLE), but more because the 'C' encodings are such a dog chewed mess that
    I don't feel terribly inclined to bother with them.


    But, like, I can't really compare BJX2 Baseline with RV64G in terms of
    code density, because this wouldn't be a fair comparison. Would need to compare code-density between Baseline and RV64GC, which would imply
    needing to actually support the C extension.

    I could already claim a "win" here if I wanted, but as I see it, doing
    so would not be valid.


    Theoretically, encoding space exists for bigger ops in RISC-V, but no
    one has defined ops there yet as far as I know. Also, the way RISC-V represents larger ops is very different.

    However, comparing fixed-length against VLE when the VLE only has larger instructions, is still acceptable as I see it (even if larger
    instructions can still allow a more compact encoding in some cases).


    Say, for example, as I see it, SuperH vs Thumb2 would still be a fair comparison, as would Thumb2 vs RV32GC, but Thumb2 vs RV32G would not.

    Unless one only cares about "absolute code density" irrespective of
    keeping parity in terms of feature-set.


                                  Also technically could be retrofitted
    onto RISC-V without any significant change, unlike some other options
    (as noted, I don't argue for adding Jumbo prefixes to RV under the
    basis that there is no real viable way to add them to RV, *).

    The issue is that once you do VLE RISC-Vs ISA is no longer helping you
    get the job done, especially when you have to execute 40% more instructions


    Yeah.

    As noted, I had already been beating RISC-V in terms of performance,
    only there was a shortfall in terms of ".text" size (for the XG2 variant).


    Initially this was around a 16% delta, now down to around 5%. Nearly all
    of the size reduction thus far, has been due to fiddling with stuff in
    my compiler.

    In theory, BJX2 (XG2) should be able to win in terms of code-density, as
    the only cases where RISC-V has an advantage do not appear to be
    statistically significant.


    As also noted, I am using "-ffunction-sections" and similar (to allow
    GCC to prune unreachable functions), otherwise there is "no contest"
    (easier to win against 540K than 290K...).


    Sadly, the closest option to viable for RV would be to add the SHORI
    instruction and optionally pattern match it in the fetch/decode.

    Or, say:
       LUI Xn, Imm20
       ADD Xn, Xn, Imm12
       SHORI Xn, Imm16
       SHORI Xn, Imm16

    Then, combine LUI+ADD into a 32-bit load in the decoder (though
    probably only if the Imm12 is positive), and 2x SHORI into a combined
    "Xn=(Xn<<32)|Imm32" operation.

    This could potentially get it down to 2 clock cycles.

    Universal constants gets this down to 0 cycles......


    Possibly.


    *: To add a jumbo prefix, one needs an encoding that:
       Uses up a really big chunk of encoding space;
       Is otherwise illegal and unused.
    RISC-V doesn't have anything here.

    Which is WHY you should not jump ship from SH to RV, but jump to an
    ISA without these problems.


    Of the options that were available at the time:
    SuperH: Simple encoding and decent code density;
    RISC-V: Seemed like it would have had worse code density.
    Though, it seems that RV beats SH in this area.
    Thumb: Uglier encoding and some more awkward limitations vs SH.
    Also, condition codes, etc.
    Thumb2: Was still patent-encumbered at the time.
    PowerPC: Bleh.
    ...


    The main reason for RISC-V support is not due to "betterness", but
    rather because RISC-V is at least semi-popular (and not as bad as I
    initially thought, in retrospect).


    Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
    space that aren't yet used for anything, but aren't usable as normal
    encoding space mostly because if I put instructions in there (with the
    existing encoding schemes), I couldn't use all the registers (and they
    would not have predication or similar either). Annoyingly, the only
    types of encodings that would fit in there at present are 2RI Imm16
    ops or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
    encodings for R0..R31 anyways, interpreting the LSB of the register
    field as encoding R32..R63).

    Just another reason not to stay with what you have developed.

    In comparison, I reserve 6-major OpCodes so that a control transfer into
    data is highly likely to get Undefined OpCode exceptions rather than a
    try to execute what is in that data. Then, as it is, I still have 21-slots
    in the major OpCode group free (27 if you count the permanently reserved).

    Much of this comes from side effects of Universal Constants.


    An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
    1-cycle, is preferable....

    A consuming instruction where you don't even use a register is better
    still !!


    Can be done, but thus far 33-bit immediate values. Luckily, Imm33s
    seems to addresses around 99% of uses (for normal ALU ops and similar).

    What do you do when accessing data that the linker knows is more than
    4GB away from IP ?? or known to be outside of 0-4GB ?? externs, GOT,
    PLT, ...

    Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
    or 2x S.E8.F19), which would have indirectly allowed the Imm57s case.
    By themselves though, the difference doesn't seem enough to justify
    the cost.

    While I admit that <basically> anything bigger than 50-bits will be fine
    as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.


    The number of cases where this comes up is not statistically significant enough to have a meaningful impact on performance.

    Fraction of a percent edge-cases are not deal-breakers, as I see it.



    I do at least have some confidence that my stuff can be made usable on affordable FPGAs.

    Some of the stuff you argue for, I don't feel is viable on this class of hardware.


    Like, the challenge would be to, say, make a soft-processor and fit all
    of the stuff you are arguing for into an XC7S50 or similar (say, on one
    of the Arty boards or something).

    Or, some other sub $400 or so FPGA board (that can be targeted with the
    free version of Vivado or similar...).
    Something like an Lattice ECP5 is probably OK.

    Though, Cyclone-V or Zynq is probably not, too much room for "cheating"
    there by leveraging the ARM cores...




    Don't have enough bits in the encoding scheme to pull off a 3RI Imm64
    in 12 bytes (and allowing a 16-byte encoding would have too steep of a
    cost increase to be worthwhile).

    And yet I did.


    I am not saying it is impossible, only that I can't pull it off with my existing encoding.


    I guess it could be possible if I burnt all off the remaining encoding
    bits on it (effectively 27-bit jumbo prefixes, + the WI bit in the final instruction).

    This would preclude using these bits for anything else though.
    Debatable if it is "worth it".



    So, alas...

    Yes, alas..........

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 12 01:40:27 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/11/2024 6:06 PM, MitchAlsup1 wrote:


    While I admit that <basically> anything bigger than 50-bits will be fine
    as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.


    The number of cases where this comes up is not statistically significant enough to have a meaningful impact on performance.

    Fraction of a percent edge-cases are not deal-breakers, as I see it.

    Idle speculation::

    .globl r8_erf ; -- Begin function r8_erf
    .type r8_erf,@function
    r8_erf: ; @r8_erf
    ; %bb.0:
    add sp,sp,#-128
    std #4614300636657501161,[sp,88] // a[0]
    std #4645348406721991307,[sp,104] // a[2]
    std #4659275911028085274,[sp,112] // a[3]
    std #4595861367557309218,[sp,120] // a[4]
    std #4599171895595656694,[sp,40] // p[0]
    std #4593699784569291823,[sp,56] // p[2]
    std #4580293056851789237,[sp,64] // p[3]
    std #4559215111867327292,[sp,72] // p[4]
    std #4580359811580069319,[sp,80] // p[4]
    std #4612966212090462427,[sp] // q[0]
    std #4602930165995154489,[sp,16] // q[2]
    std #4588882433176075751,[sp,24] // q[3]
    std #4567531038595922641,[sp,32] // q[4]
    fabs r2,r1
    fcmp r3,r2,#0x3EF00000 // thresh
    bnlt r3,.LBB141_6
    ; %bb.1:
    fcmp r3,r2,#4 // xabs <= 4.0
    bnlt r3,.LBB141_7
    ; %bb.2:
    fcmp r3,r2,#0x403A8B020C49BA5E // xbig
    bngt r3,.LBB141_11
    ; %bb.3:
    fmul r3,r1,r1
    fdiv r3,#1,r3
    mov r4,#0x3F90B4FB18B485C7 // p[5]
    fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
    fadd r5,r3,#0x40048C54508800DB // q[0]
    fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
    fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
    fmul r4,r3,r4
    fmul r6,r3,r6
    mov r5,#2
    add r7,sp,#40 // p[*]
    add r8,sp,#0 // q[*]
    LBB141_4: ; %._crit_edge11
    ; =>This Inner Loop Header: Depth=1
    vec r9,{r4,r6}
    ldd r10,[r7,r5<<3,0] // p[*]
    ldd r11,[r8,r5<<3,0] // q[*]
    fadd r6,r6,r10
    fadd r4,r4,r11
    fmul r4,r3,r4
    fmul r6,r3,r6
    loop ne,r5,#4,#1
    ; %bb.5:
    fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
    fmul r3,r3,r5
    fadd r4,r4,#0x3F632147A014BAD1 // q[4]
    fdiv r3,r3,r4
    fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
    fdiv r3,r3,r2
    br .LBB141_10 // common tail
    LBB141_6: ; %._crit_edge
    fmul r3,r1,r1
    fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
    sra r2,r2,<1:13>
    cvtsd r4,#0
    mux r2,r2,r3,r4
    mov r3,#0x3FC7C7905A31C322 // a[4]
    fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
    fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
    ldd r4,[sp,104] // a[2]
    fmac r3,r2,r3,r4
    fadd r4,r2,#0x403799EE342FB2DE // b[0]
    fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
    fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
    fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
    fmul r1,r3,r1
    fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
    fdiv r2,r1,r2
    mov r1,r2
    add sp,sp,#128
    ret // 68
    LBB141_7:
    fmul r3,r2,#0x3E571E703C5F5815 // c[8]
    mov r5,#0
    mov r4,r2
    LBB141_8: ; =>This Inner Loop Header: Depth=1
    vec r6,{r3,r4}
    ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
    fadd r3,r3,r7
    fmul r3,r2,r3
    ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
    fadd r4,r4,r7
    fmul r4,r2,r4
    loop ne,r5,#7,#1
    ; %bb.9:
    fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
    fadd r4,r4,#0x4093395B7FD35F61 // d[7]
    fdiv r3,r3,r4
    LBB141_10: // common tail
    fmul r4,r2,#0x41800000 // 16.0
    fmul r4,r4,#0x3D800000 // 1/16.0
    cvtds r4,r4 // (signed)double
    cvtsd r4,r4 // (double)signed
    fadd r5,r2,-r4
    fadd r2,r2,r4
    fmul r4,r4,-r4
    fexp r4,r4 // exp()
    fmul r2,r2,-r5
    fexp r2,r2 // exp()
    fmul r2,r4,r2
    fadd r2,#0,-r2
    fmac r2,r2,r3,#0x3F000000 // 0.5
    fadd r2,r2,#0x3F000000 // 0.5
    pflt r1,0,T
    fadd r2,#0,-r2
    mov r1,r2
    add sp,sp,#128
    ret
    LBB141_11:
    fcmp r1,r1,#0
    sra r1,r1,<1:13>
    cvtsd r2,#-1 // (double)-1
    cvtsd r3,#1 // (double)+1
    mux r2,r1,r3,r2
    mov r1,r2
    add sp,sp,#128
    ret
    Lfunc_end141:
    .size r8_erf, .Lfunc_end141-r8_erf
    ; -- End function
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Apr 12 13:40:01 2024
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 11 Apr 2024 18:46:54 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    It looks to be a win-win !! =20
    =20
    Win-win under constraints of Load-Store Arch. Otherwise, it
    depends. =20
    =20
    Never seen a LD-OP architecture where the inbound memory can be in
    the Rs1 position of the instruction.
    =20

    May be. But out of 6 major integer OPs it matters only for SUB.
    By now I don't remember for sure, but I think that I had seen LD-OP >architecture that had SUBR instruction. May be, TI TMS320C30?

    ARM has LDADD - negate one argument and it becomes a subtract.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Apr 12 18:08:33 2024
    From Newsgroup: comp.arch

    On Fri, 12 Apr 2024 13:40:01 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 11 Apr 2024 18:46:54 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    It looks to be a win-win !! =20
    =20
    Win-win under constraints of Load-Store Arch. Otherwise, it
    depends. =20
    =20
    Never seen a LD-OP architecture where the inbound memory can be in
    the Rs1 position of the instruction.
    =20

    May be. But out of 6 major integer OPs it matters only for SUB.
    By now I don't remember for sure, but I think that I had seen LD-OP >architecture that had SUBR instruction. May be, TI TMS320C30?

    ARM has LDADD - negate one argument and it becomes a subtract.


    ARM LDADD is not a LD-OP instruction. It is RMW.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Fri Apr 12 13:12:28 2024
    From Newsgroup: comp.arch

    On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/11/2024 6:06 PM, MitchAlsup1 wrote:


    While I admit that <basically> anything bigger than 50-bits will be fine >>> as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.


    The number of cases where this comes up is not statistically
    significant enough to have a meaningful impact on performance.

    Fraction of a percent edge-cases are not deal-breakers, as I see it.

    Idle speculation::

        .globl    r8_erf                          ; -- Begin function r8_erf
        .type    r8_erf,@function r8_erf:                                 ; @r8_erf
    ; %bb.0:
        add    sp,sp,#-128
        std    #4614300636657501161,[sp,88]    // a[0]
        std    #4645348406721991307,[sp,104]    // a[2]
        std    #4659275911028085274,[sp,112]    // a[3]
        std    #4595861367557309218,[sp,120]    // a[4]
        std    #4599171895595656694,[sp,40]    // p[0]
        std    #4593699784569291823,[sp,56]    // p[2]
        std    #4580293056851789237,[sp,64]    // p[3]
        std    #4559215111867327292,[sp,72]    // p[4]
        std    #4580359811580069319,[sp,80]    // p[4]
        std    #4612966212090462427,[sp]    // q[0]
        std    #4602930165995154489,[sp,16]    // q[2]
        std    #4588882433176075751,[sp,24]    // q[3]
        std    #4567531038595922641,[sp,32]    // q[4]
        fabs    r2,r1
        fcmp    r3,r2,#0x3EF00000        // thresh
        bnlt    r3,.LBB141_6
    ; %bb.1:
        fcmp    r3,r2,#4            // xabs <= 4.0
        bnlt    r3,.LBB141_7
    ; %bb.2:
        fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
        bngt    r3,.LBB141_11
    ; %bb.3:
        fmul    r3,r1,r1
        fdiv    r3,#1,r3
        mov    r4,#0x3F90B4FB18B485C7        // p[5]
        fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
        fadd    r5,r3,#0x40048C54508800DB    // q[0]
        fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
        fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
        fmul    r4,r3,r4
        fmul    r6,r3,r6
        mov    r5,#2
        add    r7,sp,#40            // p[*]
        add    r8,sp,#0            // q[*] LBB141_4:                              ; %._crit_edge11
                                           ; =>This Inner Loop Header: Depth=1
        vec    r9,{r4,r6}
        ldd    r10,[r7,r5<<3,0]        // p[*]
        ldd    r11,[r8,r5<<3,0]        // q[*]
        fadd    r6,r6,r10
        fadd    r4,r4,r11
        fmul    r4,r3,r4
        fmul    r6,r3,r6
        loop    ne,r5,#4,#1
    ; %bb.5:
        fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
        fmul    r3,r3,r5
        fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
        fdiv    r3,r3,r4
        fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
        fdiv    r3,r3,r2
        br    .LBB141_10            // common tail LBB141_6:                              ; %._crit_edge
        fmul    r3,r1,r1
        fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
        sra    r2,r2,<1:13>
        cvtsd    r4,#0
        mux    r2,r2,r3,r4
        mov    r3,#0x3FC7C7905A31C322        // a[4]
        fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
        fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
        ldd    r4,[sp,104]            // a[2]
        fmac    r3,r2,r3,r4
        fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
        fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
        fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
        fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
        fmul    r1,r3,r1
        fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
        fdiv    r2,r1,r2
        mov    r1,r2
        add    sp,sp,#128
        ret                // 68
    LBB141_7:
        fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
        mov    r5,#0
        mov    r4,r2 LBB141_8:                              ; =>This Inner Loop Header: Depth=1
        vec    r6,{r3,r4}
        ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
        fadd    r3,r3,r7
        fmul    r3,r2,r3
        ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
        fadd    r4,r4,r7
        fmul    r4,r2,r4
        loop    ne,r5,#7,#1
    ; %bb.9:
        fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
        fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
        fdiv    r3,r3,r4
    LBB141_10:                // common tail
        fmul    r4,r2,#0x41800000        // 16.0
        fmul    r4,r4,#0x3D800000        // 1/16.0
        cvtds    r4,r4                // (signed)double
        cvtsd    r4,r4                // (double)signed
        fadd    r5,r2,-r4
        fadd    r2,r2,r4
        fmul    r4,r4,-r4
        fexp    r4,r4                // exp()
        fmul    r2,r2,-r5
        fexp    r2,r2                // exp()
        fmul    r2,r4,r2
        fadd    r2,#0,-r2
        fmac    r2,r2,r3,#0x3F000000        // 0.5
        fadd    r2,r2,#0x3F000000        // 0.5
        pflt    r1,0,T
        fadd    r2,#0,-r2
        mov    r1,r2
        add    sp,sp,#128
        ret
    LBB141_11:
        fcmp    r1,r1,#0
        sra    r1,r1,<1:13>
        cvtsd    r2,#-1                // (double)-1
        cvtsd    r3,#1                // (double)+1
        mux    r2,r1,r3,r2
        mov    r1,r2
        add    sp,sp,#128
        ret
    Lfunc_end141:
        .size    r8_erf, .Lfunc_end141-r8_erf
                                           ; -- End function

    These patterns seem rather unusual...
    Don't really know the ABI.

    Patterns don't really fit observations for typical compiler output
    though (mostly in the FP constants, and particular ones that fall
    outside the scope of what can be exactly represented as Binary16 or
    similar, are rare).


    .globl r8_erf ; -- Begin function r8_erf
    .type r8_erf,@function
    r8_erf: ; @r8_erf
    ; %bb.0:
    add sp,sp,#-128
    ADD -128, SP
    std #4614300636657501161,[sp,88] // a[0]
    MOV 0x400949FB3ED443E9, R3
    MOV.Q R3, (SP, 88)
    std #4645348406721991307,[sp,104] // a[2]
    MOV 0x407797C38897528B, R3
    MOV.Q R3, (SP, 104)
    std #4659275911028085274,[sp,112] // a[3]
    std #4595861367557309218,[sp,120] // a[4]
    std #4599171895595656694,[sp,40] // p[0]
    std #4593699784569291823,[sp,56] // p[2]
    std #4580293056851789237,[sp,64] // p[3]
    std #4559215111867327292,[sp,72] // p[4]
    std #4580359811580069319,[sp,80] // p[4]
    std #4612966212090462427,[sp] // q[0]
    std #4602930165995154489,[sp,16] // q[2]
    std #4588882433176075751,[sp,24] // q[3]
    std #4567531038595922641,[sp,32] // q[4]
    ... pattern is obvious enough.
    Each constant needs 12 bytes, so 16 bytes/store.

    fabs r2,r1
    fcmp r3,r2,#0x3EF00000 // thresh
    bnlt r3,.LBB141_6
    FABS R5, R6
    FLDH 0x3780, R3 //A
    FCMPGT R3, R6 //A
    BT .LBB141_6 //A

    Or (FP-IMM extension):

    FABS R5, R6
    FCMPGE 0x0DE, R6 //B (FP-IMM)
    BF .LBB141_6 //B

    ; %bb.1:
    fcmp r3,r2,#4 // xabs <= 4.0
    bnlt r3,.LBB141_7

    FCMPGE 0x110, R6
    BF .LBB141_7

    ; %bb.2:
    fcmp r3,r2,#0x403A8B020C49BA5E // xbig
    bngt r3,.LBB141_11

    MOV 0x403A8B020C49BA5E, R3
    FCMPGT R3, R6
    BT .LBB141_11

    Where, FP-IMM wont work with that value.


    ; %bb.3:
    fmul r3,r1,r1
    FMUL R5, R5, R7
    fdiv r3,#1,r3
    Skip, operation gives identity?...

    mov r4,#0x3F90B4FB18B485C7 // p[5]
    Similar.

    fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
    fadd r5,r3,#0x40048C54508800DB // q[0]
    fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
    fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]

    Turns into 4 constants, 7 FPU instructions (if no FMAC extension, 4 with FMAC). Though, at present, FMAC is slower than separate FMUL+FADD.

    So, between 8 and 11 instructions.


    fmul r4,r3,r4
    fmul r6,r3,r6
    mov r5,#2
    add r7,sp,#40 // p[*]
    add r8,sp,#0 // q[*]

    These can map 1:1.

    LBB141_4: ; %._crit_edge11
    ; =>This Inner Loop Header:
    Depth=1
    vec r9,{r4,r6}
    ldd r10,[r7,r5<<3,0] // p[*]
    ldd r11,[r8,r5<<3,0] // q[*]
    fadd r6,r6,r10
    fadd r4,r4,r11
    fmul r4,r3,r4
    fmul r6,r3,r6
    loop ne,r5,#4,#1

    Could be mapped to a scalar loop, pretty close to 1:1.

    Could possibly also be mapped over to 2x Binary64 SIMD ops, I am
    guessing 2 copies for a 4-element vector?...


    ; %bb.5:
    fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
    fmul r3,r3,r5
    fadd r4,r4,#0x3F632147A014BAD1 // q[4]
    fdiv r3,r3,r4
    fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
    fdiv r3,r3,r2
    br .LBB141_10 // common tail

    Same patterns as before.
    Would need ~ 10 ops.

    Well, could be expressed with fewer ops via jumbo-prefixed FP-IMM ops,
    but this would only give "Binary32 truncated to 29 bits" precision for
    the immediate values.

    Theoretically, could allow an FE-FE-F0 encoding for FP-IMM, which could
    give ~ 53 bits of precision. But, if one needs full Binary64, this will
    not gain much in this case.


    LBB141_6: ; %._crit_edge
    fmul r3,r1,r1
    fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
    sra r2,r2,<1:13>
    cvtsd r4,#0
    mux r2,r2,r3,r4
    mov r3,#0x3FC7C7905A31C322 // a[4]
    fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
    fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
    ldd r4,[sp,104] // a[2]
    fmac r3,r2,r3,r4
    fadd r4,r2,#0x403799EE342FB2DE // b[0]
    fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
    fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
    fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
    fmul r1,r3,r1
    fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
    fdiv r2,r1,r2
    mov r1,r2
    add sp,sp,#128
    ret // 68
    LBB141_7:
    fmul r3,r2,#0x3E571E703C5F5815 // c[8]
    mov r5,#0
    mov r4,r2
    LBB141_8: ; =>This Inner Loop Header:
    Depth=1
    vec r6,{r3,r4}
    ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
    fadd r3,r3,r7
    fmul r3,r2,r3
    ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
    fadd r4,r4,r7
    fmul r4,r2,r4
    loop ne,r5,#7,#1
    ; %bb.9:
    fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
    fadd r4,r4,#0x4093395B7FD35F61 // d[7]
    fdiv r3,r3,r4
    LBB141_10: // common tail
    fmul r4,r2,#0x41800000 // 16.0
    fmul r4,r4,#0x3D800000 // 1/16.0
    cvtds r4,r4 // (signed)double
    cvtsd r4,r4 // (double)signed
    fadd r5,r2,-r4
    fadd r2,r2,r4
    fmul r4,r4,-r4
    fexp r4,r4 // exp()
    fmul r2,r2,-r5
    fexp r2,r2 // exp()
    fmul r2,r4,r2
    fadd r2,#0,-r2
    fmac r2,r2,r3,#0x3F000000 // 0.5
    fadd r2,r2,#0x3F000000 // 0.5
    pflt r1,0,T
    fadd r2,#0,-r2
    mov r1,r2
    add sp,sp,#128
    ret
    LBB141_11:
    fcmp r1,r1,#0
    sra r1,r1,<1:13>
    cvtsd r2,#-1 // (double)-1
    cvtsd r3,#1 // (double)+1
    mux r2,r1,r3,r2
    mov r1,r2
    add sp,sp,#128
    ret
    Lfunc_end141:
    .size r8_erf, .Lfunc_end141-r8_erf
    ; -- End function

    Don't really have time at the moment to comment on the rest of this...


    In other news, found a bug in the function dependency-walking code.

    Fixing this bug got things a little closer to beak-even with RV64G GCC
    output regarding ".text" size (though, was still not sufficient to
    entirely close the gap).


    This was mostly based on noting that the compiler output had included
    some things that were not reachable from within the program being
    compiled (namely, noticing that the Doom build had included a copy of
    the MS-CRAM video decoder and similar, which was not reachable from
    anywhere within Doom).

    Some more analysis may be needed.

    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 12 23:46:33 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/11/2024 6:06 PM, MitchAlsup1 wrote:


    While I admit that <basically> anything bigger than 50-bits will be fine >>>> as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.


    The number of cases where this comes up is not statistically
    significant enough to have a meaningful impact on performance.

    Fraction of a percent edge-cases are not deal-breakers, as I see it.

    Idle speculation::

        .globl    r8_erf                          ; -- Begin function r8_erf
        .type    r8_erf,@function
    r8_erf:                                 ; @r8_erf
    ; %bb.0:
        add    sp,sp,#-128
        std    #4614300636657501161,[sp,88]    // a[0]
        std    #4645348406721991307,[sp,104]    // a[2]
        std    #4659275911028085274,[sp,112]    // a[3]
        std    #4595861367557309218,[sp,120]    // a[4]
        std    #4599171895595656694,[sp,40]    // p[0]
        std    #4593699784569291823,[sp,56]    // p[2]
        std    #4580293056851789237,[sp,64]    // p[3]
        std    #4559215111867327292,[sp,72]    // p[4]
        std    #4580359811580069319,[sp,80]    // p[4]
        std    #4612966212090462427,[sp]    // q[0]
        std    #4602930165995154489,[sp,16]    // q[2]
        std    #4588882433176075751,[sp,24]    // q[3]
        std    #4567531038595922641,[sp,32]    // q[4]
        fabs    r2,r1
        fcmp    r3,r2,#0x3EF00000        // thresh
        bnlt    r3,.LBB141_6
    ; %bb.1:
        fcmp    r3,r2,#4            // xabs <= 4.0
        bnlt    r3,.LBB141_7
    ; %bb.2:
        fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
        bngt    r3,.LBB141_11
    ; %bb.3:
        fmul    r3,r1,r1
        fdiv    r3,#1,r3
        mov    r4,#0x3F90B4FB18B485C7        // p[5]
        fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
        fadd    r5,r3,#0x40048C54508800DB    // q[0]
        fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
        fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
        fmul    r4,r3,r4
        fmul    r6,r3,r6
        mov    r5,#2
        add    r7,sp,#40            // p[*]
        add    r8,sp,#0            // q[*]
    LBB141_4:                              ; %._crit_edge11
                                           ; =>This Inner Loop Header: Depth=1
        vec    r9,{r4,r6}
        ldd    r10,[r7,r5<<3,0]        // p[*]
        ldd    r11,[r8,r5<<3,0]        // q[*]
        fadd    r6,r6,r10
        fadd    r4,r4,r11
        fmul    r4,r3,r4
        fmul    r6,r3,r6
        loop    ne,r5,#4,#1
    ; %bb.5:
        fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
        fmul    r3,r3,r5
        fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
        fdiv    r3,r3,r4
        fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
        fdiv    r3,r3,r2
        br    .LBB141_10            // common tail
    LBB141_6:                              ; %._crit_edge
        fmul    r3,r1,r1
        fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
        sra    r2,r2,<1:13>
        cvtsd    r4,#0
        mux    r2,r2,r3,r4
        mov    r3,#0x3FC7C7905A31C322        // a[4]
        fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
        fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
        ldd    r4,[sp,104]            // a[2]
        fmac    r3,r2,r3,r4
        fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
        fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
        fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
        fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
        fmul    r1,r3,r1
        fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
        fdiv    r2,r1,r2
        mov    r1,r2
        add    sp,sp,#128
        ret                // 68
    LBB141_7:
        fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
        mov    r5,#0
        mov    r4,r2
    LBB141_8:                              ; =>This Inner Loop Header: Depth=1
        vec    r6,{r3,r4}
        ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
        fadd    r3,r3,r7
        fmul    r3,r2,r3
        ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
        fadd    r4,r4,r7
        fmul    r4,r2,r4
        loop    ne,r5,#7,#1
    ; %bb.9:
        fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
        fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
        fdiv    r3,r3,r4
    LBB141_10:                // common tail
        fmul    r4,r2,#0x41800000        // 16.0
        fmul    r4,r4,#0x3D800000        // 1/16.0
        cvtds    r4,r4                // (signed)double >>     cvtsd    r4,r4                // (double)signed >>     fadd    r5,r2,-r4
        fadd    r2,r2,r4
        fmul    r4,r4,-r4
        fexp    r4,r4                // exp()
        fmul    r2,r2,-r5
        fexp    r2,r2                // exp()
        fmul    r2,r4,r2
        fadd    r2,#0,-r2
        fmac    r2,r2,r3,#0x3F000000        // 0.5
        fadd    r2,r2,#0x3F000000        // 0.5
        pflt    r1,0,T
        fadd    r2,#0,-r2
        mov    r1,r2
        add    sp,sp,#128
        ret
    LBB141_11:
        fcmp    r1,r1,#0
        sra    r1,r1,<1:13>
        cvtsd    r2,#-1                // (double)-1
        cvtsd    r3,#1                // (double)+1
        mux    r2,r1,r3,r2
        mov    r1,r2
        add    sp,sp,#128
        ret
    Lfunc_end141:
        .size    r8_erf, .Lfunc_end141-r8_erf
                                           ; -- End function

    These patterns seem rather unusual...
    Don't really know the ABI.

    Patterns don't really fit observations for typical compiler output
    though (mostly in the FP constants, and particular ones that fall
    outside the scope of what can be exactly represented as Binary16 or
    similar, are rare).


    .globl r8_erf ; -- Begin function r8_erf
    .type r8_erf,@function
    r8_erf: ; @r8_erf
    ; %bb.0:
    add sp,sp,#-128
    ADD -128, SP
    std #4614300636657501161,[sp,88] // a[0]
    MOV 0x400949FB3ED443E9, R3
    MOV.Q R3, (SP, 88)
    std #4645348406721991307,[sp,104] // a[2]
    MOV 0x407797C38897528B, R3
    MOV.Q R3, (SP, 104)
    std #4659275911028085274,[sp,112] // a[3]
    std #4595861367557309218,[sp,120] // a[4]
    std #4599171895595656694,[sp,40] // p[0]
    std #4593699784569291823,[sp,56] // p[2]
    std #4580293056851789237,[sp,64] // p[3]
    std #4559215111867327292,[sp,72] // p[4]
    std #4580359811580069319,[sp,80] // p[4]
    std #4612966212090462427,[sp] // q[0]
    std #4602930165995154489,[sp,16] // q[2]
    std #4588882433176075751,[sp,24] // q[3]
    std #4567531038595922641,[sp,32] // q[4]
    .... pattern is obvious enough.
    Each constant needs 12 bytes, so 16 bytes/store.

    But 2 instructions instead of 1 and 16 bytes instead of 12.

    fabs r2,r1
    fcmp r3,r2,#0x3EF00000 // thresh
    bnlt r3,.LBB141_6
    FABS R5, R6
    FLDH 0x3780, R3 //A
    FCMPGT R3, R6 //A
    BT .LBB141_6 //A

    Or (FP-IMM extension):

    FABS R5, R6
    FCMPGE 0x0DE, R6 //B (FP-IMM)
    BF .LBB141_6 //B

    ; %bb.1:
    fcmp r3,r2,#4 // xabs <= 4.0
    bnlt r3,.LBB141_7

    FCMPGE 0x110, R6
    BF .LBB141_7

    ; %bb.2:
    fcmp r3,r2,#0x403A8B020C49BA5E // xbig
    bngt r3,.LBB141_11

    MOV 0x403A8B020C49BA5E, R3
    FCMPGT R3, R6
    BT .LBB141_11

    Where, FP-IMM wont work with that value.

    Value came from source code.

    ; %bb.3:
    fmul r3,r1,r1
    FMUL R5, R5, R7
    fdiv r3,#1,r3
    Skip, operation gives identity?...

    It is a reciprocate R3 = #1.0/R3

    mov r4,#0x3F90B4FB18B485C7 // p[5]
    Similar.

    fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
    fadd r5,r3,#0x40048C54508800DB // q[0]
    fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
    fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]

    Turns into 4 constants, 7 FPU instructions (if no FMAC extension, 4 with FMAC). Though, at present, FMAC is slower than separate FMUL+FADD.

    So, between 8 and 11 instructions.

    Instead of 4.....

    fmul r4,r3,r4
    fmul r6,r3,r6
    mov r5,#2
    add r7,sp,#40 // p[*]
    add r8,sp,#0 // q[*]

    These can map 1:1.

    LBB141_4: ; %._crit_edge11
    ; =>This Inner Loop Header:
    Depth=1
    vec r9,{r4,r6}
    ldd r10,[r7,r5<<3,0] // p[*]
    ldd r11,[r8,r5<<3,0] // q[*]
    fadd r6,r6,r10
    fadd r4,r4,r11
    fmul r4,r3,r4
    fmul r6,r3,r6
    loop ne,r5,#4,#1

    Could be mapped to a scalar loop, pretty close to 1:1.

    I have 7 instructions in the loop, you would have 9.

    Could possibly also be mapped over to 2x Binary64 SIMD ops, I am
    guessing 2 copies for a 4-element vector?...


    ; %bb.5:
    fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
    fmul r3,r3,r5
    fadd r4,r4,#0x3F632147A014BAD1 // q[4]
    fdiv r3,r3,r4
    fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
    fdiv r3,r3,r2
    br .LBB141_10 // common tail

    Same patterns as before.
    Would need ~ 10 ops.

    Well, could be expressed with fewer ops via jumbo-prefixed FP-IMM ops,
    but this would only give "Binary32 truncated to 29 bits" precision for
    the immediate values.

    Theoretically, could allow an FE-FE-F0 encoding for FP-IMM, which could
    give ~ 53 bits of precision. But, if one needs full Binary64, this will
    not gain much in this case.


    LBB141_6: ; %._crit_edge
    fmul r3,r1,r1
    fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
    sra r2,r2,<1:13>
    cvtsd r4,#0
    mux r2,r2,r3,r4
    mov r3,#0x3FC7C7905A31C322 // a[4]
    fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
    fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
    ldd r4,[sp,104] // a[2]
    fmac r3,r2,r3,r4
    fadd r4,r2,#0x403799EE342FB2DE // b[0]
    fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
    fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
    fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
    fmul r1,r3,r1
    fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
    fdiv r2,r1,r2
    mov r1,r2
    add sp,sp,#128
    ret // 68
    LBB141_7:
    fmul r3,r2,#0x3E571E703C5F5815 // c[8]
    mov r5,#0
    mov r4,r2
    LBB141_8: ; =>This Inner Loop Header:
    Depth=1
    vec r6,{r3,r4}
    ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
    fadd r3,r3,r7
    fmul r3,r2,r3
    ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
    fadd r4,r4,r7
    fmul r4,r2,r4
    loop ne,r5,#7,#1
    ; %bb.9:
    fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
    fadd r4,r4,#0x4093395B7FD35F61 // d[7]
    fdiv r3,r3,r4
    LBB141_10: // common tail
    fmul r4,r2,#0x41800000 // 16.0
    fmul r4,r4,#0x3D800000 // 1/16.0
    cvtds r4,r4 // (signed)double
    cvtsd r4,r4 // (double)signed
    fadd r5,r2,-r4
    fadd r2,r2,r4
    fmul r4,r4,-r4
    fexp r4,r4 // exp()
    fmul r2,r2,-r5
    fexp r2,r2 // exp()
    fmul r2,r4,r2
    fadd r2,#0,-r2
    fmac r2,r2,r3,#0x3F000000 // 0.5
    fadd r2,r2,#0x3F000000 // 0.5
    pflt r1,0,T
    fadd r2,#0,-r2
    mov r1,r2
    add sp,sp,#128
    ret
    LBB141_11:
    fcmp r1,r1,#0
    sra r1,r1,<1:13>
    cvtsd r2,#-1 // (double)-1
    cvtsd r3,#1 // (double)+1
    mux r2,r1,r3,r2
    mov r1,r2
    add sp,sp,#128
    ret
    Lfunc_end141:
    .size r8_erf, .Lfunc_end141-r8_erf
    ; -- End function

    Don't really have time at the moment to comment on the rest of this...


    In other news, found a bug in the function dependency-walking code.

    Fixing this bug got things a little closer to beak-even with RV64G GCC output regarding ".text" size (though, was still not sufficient to
    entirely close the gap).


    This was mostly based on noting that the compiler output had included
    some things that were not reachable from within the program being
    compiled (namely, noticing that the Doom build had included a copy of
    the MS-CRAM video decoder and similar, which was not reachable from
    anywhere within Doom).

    Some more analysis may be needed.

    ....
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 13 03:17:43 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/11/2024 6:06 PM, MitchAlsup1 wrote:


    While I admit that <basically> anything bigger than 50-bits will be fine >>>> as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.


    The number of cases where this comes up is not statistically
    significant enough to have a meaningful impact on performance.

    Fraction of a percent edge-cases are not deal-breakers, as I see it.

    Idle speculation::

        .globl    r8_erf                          ; -- Begin function r8_erf
        .type    r8_erf,@function
    r8_erf:                                 ; @r8_erf
    ; %bb.0:
        add    sp,sp,#-128
        std    #4614300636657501161,[sp,88]    // a[0]
        std    #4645348406721991307,[sp,104]    // a[2]
        std    #4659275911028085274,[sp,112]    // a[3]
        std    #4595861367557309218,[sp,120]    // a[4]
        std    #4599171895595656694,[sp,40]    // p[0]
        std    #4593699784569291823,[sp,56]    // p[2]
        std    #4580293056851789237,[sp,64]    // p[3]
        std    #4559215111867327292,[sp,72]    // p[4]
        std    #4580359811580069319,[sp,80]    // p[4]
        std    #4612966212090462427,[sp]    // q[0]
        std    #4602930165995154489,[sp,16]    // q[2]
        std    #4588882433176075751,[sp,24]    // q[3]
        std    #4567531038595922641,[sp,32]    // q[4]
        fabs    r2,r1
        fcmp    r3,r2,#0x3EF00000        // thresh
        bnlt    r3,.LBB141_6
    ; %bb.1:
        fcmp    r3,r2,#4            // xabs <= 4.0
        bnlt    r3,.LBB141_7
    ; %bb.2:
        fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
        bngt    r3,.LBB141_11
    ; %bb.3:
        fmul    r3,r1,r1
        fdiv    r3,#1,r3
        mov    r4,#0x3F90B4FB18B485C7        // p[5]
        fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
        fadd    r5,r3,#0x40048C54508800DB    // q[0]
        fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
        fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
        fmul    r4,r3,r4
        fmul    r6,r3,r6
        mov    r5,#2
        add    r7,sp,#40            // p[*]
        add    r8,sp,#0            // q[*]
    LBB141_4:                              ; %._crit_edge11
                                           ; =>This Inner Loop Header: Depth=1
        vec    r9,{r4,r6}
        ldd    r10,[r7,r5<<3,0]        // p[*]
        ldd    r11,[r8,r5<<3,0]        // q[*]
        fadd    r6,r6,r10
        fadd    r4,r4,r11
        fmul    r4,r3,r4
        fmul    r6,r3,r6
        loop    ne,r5,#4,#1
    ; %bb.5:
        fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
        fmul    r3,r3,r5
        fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
        fdiv    r3,r3,r4
        fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
        fdiv    r3,r3,r2
        br    .LBB141_10            // common tail
    LBB141_6:                              ; %._crit_edge
        fmul    r3,r1,r1
        fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
        sra    r2,r2,<1:13>
        cvtsd    r4,#0
        mux    r2,r2,r3,r4
        mov    r3,#0x3FC7C7905A31C322        // a[4]
        fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
        fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
        ldd    r4,[sp,104]            // a[2]
        fmac    r3,r2,r3,r4
        fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
        fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
        fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
        fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
        fmul    r1,r3,r1
        fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
        fdiv    r2,r1,r2
        mov    r1,r2
        add    sp,sp,#128
        ret                // 68
    LBB141_7:
        fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
        mov    r5,#0
        mov    r4,r2
    LBB141_8:                              ; =>This Inner Loop Header: Depth=1
        vec    r6,{r3,r4}
        ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
        fadd    r3,r3,r7
        fmul    r3,r2,r3
        ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
        fadd    r4,r4,r7
        fmul    r4,r2,r4
        loop    ne,r5,#7,#1
    ; %bb.9:
        fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
        fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
        fdiv    r3,r3,r4
    LBB141_10:                // common tail
        fmul    r4,r2,#0x41800000        // 16.0
        fmul    r4,r4,#0x3D800000        // 1/16.0
        cvtds    r4,r4                // (signed)double >>     cvtsd    r4,r4                // (double)signed >>     fadd    r5,r2,-r4
        fadd    r2,r2,r4
        fmul    r4,r4,-r4
        fexp    r4,r4                // exp()
        fmul    r2,r2,-r5
        fexp    r2,r2                // exp()
        fmul    r2,r4,r2
        fadd    r2,#0,-r2
        fmac    r2,r2,r3,#0x3F000000        // 0.5
        fadd    r2,r2,#0x3F000000        // 0.5
        pflt    r1,0,T
        fadd    r2,#0,-r2
        mov    r1,r2
        add    sp,sp,#128
        ret
    LBB141_11:
        fcmp    r1,r1,#0
        sra    r1,r1,<1:13>
        cvtsd    r2,#-1                // (double)-1
        cvtsd    r3,#1                // (double)+1
        mux    r2,r1,r3,r2
        mov    r1,r2
        add    sp,sp,#128
        ret
    Lfunc_end141:
        .size    r8_erf, .Lfunc_end141-r8_erf
                                           ; -- End function

    These patterns seem rather unusual...
    Don't really know the ABI.

    Patterns don't really fit observations for typical compiler output
    though (mostly in the FP constants, and particular ones that fall
    outside the scope of what can be exactly represented as Binary16 or
    similar, are rare).

    You are N E V E R going to find the coefficients of a Chebyshev
    polynomial to fit in a small FP container; excepting the very
    occasional C0 or C1 term {which are mostly 1.0 and 0.0}
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sat Apr 13 01:12:53 2024
    From Newsgroup: comp.arch

    On 4/12/2024 10:17 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/11/2024 6:06 PM, MitchAlsup1 wrote:


    While I admit that <basically> anything bigger than 50-bits will be >>>>> fine
    as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.


    The number of cases where this comes up is not statistically
    significant enough to have a meaningful impact on performance.

    Fraction of a percent edge-cases are not deal-breakers, as I see it.

    Idle speculation::

         .globl    r8_erf                          ; -- Begin function
    r8_erf
         .type    r8_erf,@function
    r8_erf:                                 ; @r8_erf
    ; %bb.0:
         add    sp,sp,#-128
         std    #4614300636657501161,[sp,88]    // a[0]
         std    #4645348406721991307,[sp,104]    // a[2]
         std    #4659275911028085274,[sp,112]    // a[3]
         std    #4595861367557309218,[sp,120]    // a[4]
         std    #4599171895595656694,[sp,40]    // p[0]
         std    #4593699784569291823,[sp,56]    // p[2]
         std    #4580293056851789237,[sp,64]    // p[3]
         std    #4559215111867327292,[sp,72]    // p[4]
         std    #4580359811580069319,[sp,80]    // p[4]
         std    #4612966212090462427,[sp]    // q[0]
         std    #4602930165995154489,[sp,16]    // q[2]
         std    #4588882433176075751,[sp,24]    // q[3]
         std    #4567531038595922641,[sp,32]    // q[4]
         fabs    r2,r1
         fcmp    r3,r2,#0x3EF00000        // thresh
         bnlt    r3,.LBB141_6
    ; %bb.1:
         fcmp    r3,r2,#4            // xabs <= 4.0
         bnlt    r3,.LBB141_7
    ; %bb.2:
         fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
         bngt    r3,.LBB141_11
    ; %bb.3:
         fmul    r3,r1,r1
         fdiv    r3,#1,r3
         mov    r4,#0x3F90B4FB18B485C7        // p[5]
         fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
         fadd    r5,r3,#0x40048C54508800DB    // q[0]
         fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
         fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
         fmul    r4,r3,r4
         fmul    r6,r3,r6
         mov    r5,#2
         add    r7,sp,#40            // p[*]
         add    r8,sp,#0            // q[*]
    LBB141_4:                              ; %._crit_edge11
                                            ; =>This Inner Loop Header:
    Depth=1
         vec    r9,{r4,r6}
         ldd    r10,[r7,r5<<3,0]        // p[*]
         ldd    r11,[r8,r5<<3,0]        // q[*]
         fadd    r6,r6,r10
         fadd    r4,r4,r11
         fmul    r4,r3,r4
         fmul    r6,r3,r6
         loop    ne,r5,#4,#1
    ; %bb.5:
         fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
         fmul    r3,r3,r5
         fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
         fdiv    r3,r3,r4
         fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
         fdiv    r3,r3,r2
         br    .LBB141_10            // common tail
    LBB141_6:                              ; %._crit_edge
         fmul    r3,r1,r1
         fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
         sra    r2,r2,<1:13>
         cvtsd    r4,#0
         mux    r2,r2,r3,r4
         mov    r3,#0x3FC7C7905A31C322        // a[4]
         fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
         fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
         ldd    r4,[sp,104]            // a[2]
         fmac    r3,r2,r3,r4
         fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
         fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
         fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
         fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
         fmul    r1,r3,r1
         fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
         fdiv    r2,r1,r2
         mov    r1,r2
         add    sp,sp,#128
         ret                // 68
    LBB141_7:
         fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
         mov    r5,#0
         mov    r4,r2
    LBB141_8:                              ; =>This Inner Loop Header:
    Depth=1
         vec    r6,{r3,r4}
         ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
         fadd    r3,r3,r7
         fmul    r3,r2,r3
         ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
         fadd    r4,r4,r7
         fmul    r4,r2,r4
         loop    ne,r5,#7,#1
    ; %bb.9:
         fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
         fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
         fdiv    r3,r3,r4
    LBB141_10:                // common tail
         fmul    r4,r2,#0x41800000        // 16.0
         fmul    r4,r4,#0x3D800000        // 1/16.0
         cvtds    r4,r4                // (signed)double >>>      cvtsd    r4,r4                // (double)signed >>>      fadd    r5,r2,-r4
         fadd    r2,r2,r4
         fmul    r4,r4,-r4
         fexp    r4,r4                // exp()
         fmul    r2,r2,-r5
         fexp    r2,r2                // exp()
         fmul    r2,r4,r2
         fadd    r2,#0,-r2
         fmac    r2,r2,r3,#0x3F000000        // 0.5
         fadd    r2,r2,#0x3F000000        // 0.5
         pflt    r1,0,T
         fadd    r2,#0,-r2
         mov    r1,r2
         add    sp,sp,#128
         ret
    LBB141_11:
         fcmp    r1,r1,#0
         sra    r1,r1,<1:13>
         cvtsd    r2,#-1                // (double)-1 >>>      cvtsd    r3,#1                // (double)+1
         mux    r2,r1,r3,r2
         mov    r1,r2
         add    sp,sp,#128
         ret
    Lfunc_end141:
         .size    r8_erf, .Lfunc_end141-r8_erf
                                            ; -- End function

    These patterns seem rather unusual...
    Don't really know the ABI.

    Patterns don't really fit observations for typical compiler output
    though (mostly in the FP constants, and particular ones that fall
    outside the scope of what can be exactly represented as Binary16 or
    similar, are rare).

    You are N E V E R going to find the coefficients of a Chebyshev
    polynomial to fit in a small FP container; excepting the very
    occasional C0 or C1 term {which are mostly 1.0 and 0.0}


    Some stats I have (for GLQuake):
    14.9% of constants are floating-point.
    10.99% are FP and can be expressed exactly as Binary16.
    7.3% as Fp5 (E3.F2)
    9.5% as Fp10 (S.E5.F6)
    1.3% can be expressed in Binary32
    2.7% need Binary64.

    If scaled so that this is only FP constants:
    73.0% are Binary16
    8.7% are Binary32
    18.1% are Binary64



    Granted, this is inexact, as the stat is based on pattern recognition
    rather than type. However, given that for Doom the total percentage of constants flagged as FP drops to around 1%, probably not too far off.


    So, here, it seems it is common enough to where ability to load it into
    a register in 1 cycle is worthwhile, but not so much that I am all that worried about needing to spend an instruction to do so.

    More so, when the 1 cycle spent on the constant load, is overshadowed by
    the 6 cycles it takes to to a Binary64 FADD or FMUL (faster only exists
    for low-precision ops, or for Binary16/Binary32 SIMD).


    Can also note, for integers immediate values:
    3RI Imm9un: 97% hit-rate
    2% turn into Jumbo Imm33s
    1% require a separate constant
    2RI Imm10un: 94% hit-rate
    4.4% turn into 2RI Imm16
    1.5% turn into Jumbo Imm33s
    0.1% require a separate constant
    Ld/St Disp9u: 96.4%
    0.18% are negative
    3.42%: Jumbo Disp33s


    For RISC-V, the Imm12s case does result in a better hit rate for the
    basic instructions, albeit the fallback case is worse (LUI+ADD or a
    memory load).

    Whereas, in my case, it is more a question of whether it ends up better
    to load the immediate into a register or to use a jumbo prefix (where
    the compile may look forward and make a guess).



    In a world where I could have the ability to directly store constants to memory, or glue full 64 bit constants onto any instruction, etc, it
    doesn't seem likely that this would have all that large of an impact on
    either program size or performance.





    Though, at the moment (in ongoing compiler fiddling), I am not seeing
    much more evidence of unrelated / unreachable code being included in the binary. Seems like this optimization case may be used up.

    This leaves roughly another 4% remaining...

    I guess, will see how much of a fight this last 4% puts up...


    Though, looks like I could in theory shave several kB off mostly by
    disabling the memcpy slide and making the limit for inline memcpy
    smaller and similar, but this comes at a performance impact (needs to go
    the slower route of "actually calling memcpy()" in more cases...). Will continue to look for other options.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brian G. Lucas@bagel99@gmail.com to comp.arch on Sat Apr 13 12:51:18 2024
    From Newsgroup: comp.arch

    On 4/9/24 13:24, Thomas Koenig wrote:
    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates >>>>> 36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed >>>>> in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    I played around a bit with another scheme: Encoding things into
    128-bit blocks, with either 21-bit or 42-bit or longer instructions
    (or a block header with six bits, and 20 or 40 bits for each
    instruction).

    Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
    destructive operand model for the 21-bit encodings. Yes :: no ??

    It was not very well developed, I gave it up when I saw there wasn't
    much to gain.

    Maybe one more thing: In order to justify the more complex encoding,
    I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.
    If you want to know more about MCore, you can contact me.
    I was the initial designer of the mcore ISA. It was targeted
    at embedded processors, particularly control processors in phones
    and radios. It was extended and found its way into GPS receivers and
    set top boxes. Motorola licensed it to the Chinese and there it is
    known as CSky ISAv1 (there is a different ISAv2). There is even a
    supported Linux port of CSky v1.

    brian

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 14 22:58:22 2024
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    <snip>

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
    tag value for callee saved registers. Fortunately, the My 66000 architecture comes to the rescue here. You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running process.

    I had missed this until now:: The stack remains 64-bit aligned at all times,
    so if you add 32-bits to the stack you actually add 64-bits to the stack.

    Given this, you an effectively use a 2-bit tag {integral, floating, pointing, describing}. The difference between pointing and describing is that pointing
    is C-like, while describing is dope-vector-like. {{Although others may find something else to put in the 4-th slot.}}

    Any comments are welcome.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 14 23:25:52 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:

    I have a similar problem for the carry and overflow bits in
    < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
    let those bits not survive across calls; if there was a cheap solution
    for the problem, it would eliminate this drawback of my idea.

    My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
    whereas RISC-V encodes the inner loop in 11 instructions.

    Source code:

    void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
    {
    uint64_t c = 0;
    for( int i = 0; i < n; i++ )
    {
    {c, sum[i]} = a[i] + b[i] + c;
    }
    return
    }

    Assembly code::

    .global mpn_add_n
    mpn_add_n:
    MOV R5,#0 // c
    MOV R6,#0 // i

    VEC R7,{}
    LDD R8,[R2,Ri<<3]
    LDD R9,[R3,Ri<<3]
    CARRY R5,{{IO}}
    ADD R10,R8,R9
    STD R10,[R1,Ri<<3]
    LOOP LT,R6,#1,R4
    RET

    So, adding a few "bells and whistles" to RISC-V does give you a
    performance gain (1.38×); using a well designed ISA gives you a
    performance gain of 2.00× !! {{moral: don't stop too early}}

    Note that all the register bookkeeping has disappeared !! because
    of the indexed memory reference form.

    As I count executing instructions, VEC does not execute, nor does
    CARRY--CARRY causes the subsequent ADD to take C input as carry and
    the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
    BC sequence in a single instruction and in a single clock.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Apr 15 10:02:46 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    Anton Ertl wrote:

    I have a similar problem for the carry and overflow bits in
    < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to>> let those bits not survive across calls; if there was a cheap solution>> for the problem, it would eliminate this drawback of my idea.

    My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
    whereas RISC-V encodes the inner loop in 11 instructions.

    Source code:

    void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
    {
        uint64_t c = 0;
        for( int i = 0; i < n; i++ )
        {
             {c, sum[i]} = a[i] + b[i] + c;
        }
        return
    }

    Assembly code::

        .global mpn_add_n
    mpn_add_n:
        MOV   R5,#0     // c
        MOV   R6,#0     // i

        VEC   R7,{}
        LDD   R8,[R2,Ri<<3]
        LDD   R9,[R3,Ri<<3]
        CARRY R5,{{IO}}
        ADD   R10,R8,R9
        STD   R10,[R1,Ri<<3]
        LOOP  LT,R6,#1,R4
        RET

    So, adding a few "bells and whistles" to RISC-V does give you a
    performance gain (1.38×); using a well designed ISA gives you a performance gain of 2.00× !! {{moral: don't stop too early}}

    Note that all the register bookkeeping has disappeared !! because
    of the indexed memory reference form.

    As I count executing instructions, VEC does not execute, nor does CARRY--CARRY causes the subsequent ADD to take C input as carry and
    the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
    BC sequence in a single instruction and in a single clock.
    ; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
    xor rax,rax ;; Clear carry
    next:
    mov rax,[rsi+rcx*8]
    adc rax,[rdx+rcx*8]
    mov [rdi+rcx*8],rax
    inc rcx
    jnz next
    The code above is 5 instructions, or 6 if we avoid the load-op, doing
    two loads and one store, so it should only be limited by the latency of
    the ADC, i.e. one or two cycles.
    In the non-OoO (i.e Pentium) days, I would have inverted the loop in
    order to hide the latencies as much as possible, resulting in an inner
    loop something like this:
    next:
    adc eax,ebx
    mov ebx,[edx+ecx*4] ; First cycle
    mov [edi+ecx*4],eax
    mov eax,[esi+ecx*4] ; Second cycle
    inc ecx
    jnz next ; Third cycle
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Apr 15 11:16:15 2024
    From Newsgroup: comp.arch

    Terje Mathisen wrote:
    MitchAlsup1 wrote:
    Anton Ertl wrote:

    I have a similar problem for the carry and overflow bits in
    < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
    let those bits not survive across calls; if there was a cheap solution
    for the problem, it would eliminate this drawback of my idea.

    My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
    whereas RISC-V encodes the inner loop in 11 instructions.

    Source code:

    void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
    {
     Â Â Â  uint64_t c = 0;
     Â Â Â  for( int i = 0; i < n; i++ )
     Â Â Â  {
     Â Â Â Â Â Â Â Â  {c, sum[i]} = a[i] + b[i] + c;
     Â Â Â  }
     Â Â Â  return
    }

    Assembly code::

     Â Â Â  .global mpn_add_n
    mpn_add_n:
     Â Â Â  MOV   R5,#0     // c
     Â Â Â  MOV   R6,#0     // i

     Â Â Â  VEC   R7,{}
     Â Â Â  LDD   R8,[R2,Ri<<3]
     Â Â Â  LDD   R9,[R3,Ri<<3]
     Â Â Â  CARRY R5,{{IO}}
     Â Â Â  ADD   R10,R8,R9
     Â Â Â  STD   R10,[R1,Ri<<3]
     Â Â Â  LOOP  LT,R6,#1,R4
     Â Â Â  RET

    So, adding a few "bells and whistles" to RISC-V does give you a
    performance gain (1.38×); using a well designed ISA gives you a >> performance gain of 2.00× !! {{moral: don't stop too early}}

    Note that all the register bookkeeping has disappeared !! because
    of the indexed memory reference form.

    As I count executing instructions, VEC does not execute, nor does
    CARRY--CARRY causes the subsequent ADD to take C input as carry and
    the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
    BC sequence in a single instruction and in a single clock.

      ; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
      xor rax,rax ;; Clear carry
    next:
      mov rax,[rsi+rcx*8]
      adc rax,[rdx+rcx*8]
      mov [rdi+rcx*8],rax
      inc rcx
       jnz next

    The code above is 5 instructions, or 6 if we avoid the load-op, doing
    two loads and one store, so it should only be limited by the latency of
    the ADC, i.e. one or two cycles.

    In the non-OoO (i.e Pentium) days, I would have inverted the loop in
    order to hide the latencies as much as possible, resulting in an inner > loop something like this:

     next:
      adc eax,ebx
      mov ebx,[edx+ecx*4]    ; First cycle

      mov [edi+ecx*4],eax
      mov eax,[esi+ecx*4]    ; Second cycle

      inc ecx
       jnz next        ; Third cycle

    In the same bad old days, the standard way to speed it up would have
    used unrolling, but until we got more registers, it would have stopped
    itself very quickly. With AVX2 we could use 4 64-bit slots in a 32-byte register, but then we would have needed to handle the carry propagation manually, and that would take longer than a series of ADC/ADX instructions. next4:
    mov eax,[esi]
    adc eax,[esi+edx]
    mov [esi+edi],eax
    mov eax,[esi+4]
    adc eax,[esi+edx+4]
    mov [esi+edi+4],eax
    mov eax,[esi+8]
    adc eax,[esi+edx+8]
    mov [esi+edi+8],eax
    mov eax,[esi+12]
    adc eax,[esi+edx+12]
    mov [esi+edi+12],eax
    lea esi,[esi+16]
    dec ecx
    jnz next4
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Apr 15 19:03:34 2024
    From Newsgroup: comp.arch

    Michael S wrote:

    On Thu, 11 Apr 2024 18:46:54 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    On 4/11/2024 6:13 AM, Michael S wrote:
    On Wed, 10 Apr 2024 23:30:02 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    It does occupy some icache space, however; have you boosted the
    icache size to compensate?

    The space occupied in the ICache is freed up from being in the
    DCache so the overall hit rate goes up !! At typical sizes,
    ICache miss rate is about ¼ the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD
    instruction and 1 or 2 words in DCache, while consuming a GPR. So,
    overall, it takes fewer cycles, fewer GPRs, and fewer
    instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you
    have no direct route to either 64-bit constants or 64-bit address
    spaces.

    It looks to be a win-win !!

    Win-win under constraints of Load-Store Arch. Otherwise, it
    depends.

    Never seen a LD-OP architecture where the inbound memory can be in
    the Rs1 position of the instruction.


    May be. But out of 6 major integer OPs it matters only for SUB.
    By now I don't remember for sure, but I think that I had seen LD-OP architecture that had SUBR instruction. May be, TI TMS320C30?
    It was 30 years ago and my memory is not what it used to be.

    That a SUBR instruction exists does not disavow my statement that
    the inbound memory reference was never in the Rs1 position.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Apr 15 20:55:53 2024
    From Newsgroup: comp.arch

    Terje Mathisen wrote:

    MitchAlsup1 wrote:


    In the non-OoO (i.e Pentium) days, I would have inverted the loop in
    order to hide the latencies as much as possible, resulting in an inner
    loop something like this:

    next:
    adc eax,ebx
    mov ebx,[edx+ecx*4] ; First cycle

    mov [edi+ecx*4],eax
    mov eax,[esi+ecx*4] ; Second cycle

    inc ecx
    jnz next ; Third cycle

    Terje

    As opposed to::

    .global mpn_add_n
    mpn_add_n:
    MOV R5,#0 // c
    MOV R6,#0 // i

    VEC R7,{}
    LDD R8,[R2,Ri<<3] // Load 128-to-512 bits
    LDD R9,[R3,Ri<<3] // Load 128-to-512 bits
    CARRY R5,{{IO}}
    ADD R10,R8,R9 // Add pair to add octal
    STD R10,[R1,Ri<<3] // Store 128-to-512 bits
    LOOP LT,R6,#1,R4 // increment 2-to-8 times
    RET

    --------------------------------------------------------

    LDD R8,[R2,Ri<<3] // AGEN cycle 1
    LDD R9,[R3,Ri<<3] // AGEN cycle 2 data cycle 4
    CARRY R5,{{IO}}
    ADD R10,R8,R9 // cycle 4
    STD R10,[R1,Ri<<3] // AGEN cycle 3 write cycle 5
    LOOP LT,R6,#1,R4 // cycle 3

    OR

    LDD LDd
    LDD LDd
    ADD
    ST STd
    LOOP
    LDD LDd
    LDD LDd
    ADD
    ST STd
    LOOP

    10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM machine !! without code scheduling heroics.

    40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM machine !! --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Mon Apr 15 16:14:24 2024
    From Newsgroup: comp.arch

    On 4/12/2024 10:17 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/11/2024 6:06 PM, MitchAlsup1 wrote:


    While I admit that <basically> anything bigger than 50-bits will be >>>>> fine
    as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.


    The number of cases where this comes up is not statistically
    significant enough to have a meaningful impact on performance.

    Fraction of a percent edge-cases are not deal-breakers, as I see it.

    Idle speculation::

         .globl    r8_erf                          ; -- Begin function
    r8_erf
         .type    r8_erf,@function
    r8_erf:                                 ; @r8_erf
    ; %bb.0:
         add    sp,sp,#-128
         std    #4614300636657501161,[sp,88]    // a[0]
         std    #4645348406721991307,[sp,104]    // a[2]
         std    #4659275911028085274,[sp,112]    // a[3]
         std    #4595861367557309218,[sp,120]    // a[4]
         std    #4599171895595656694,[sp,40]    // p[0]
         std    #4593699784569291823,[sp,56]    // p[2]
         std    #4580293056851789237,[sp,64]    // p[3]
         std    #4559215111867327292,[sp,72]    // p[4]
         std    #4580359811580069319,[sp,80]    // p[4]
         std    #4612966212090462427,[sp]    // q[0]
         std    #4602930165995154489,[sp,16]    // q[2]
         std    #4588882433176075751,[sp,24]    // q[3]
         std    #4567531038595922641,[sp,32]    // q[4]
         fabs    r2,r1
         fcmp    r3,r2,#0x3EF00000        // thresh
         bnlt    r3,.LBB141_6
    ; %bb.1:
         fcmp    r3,r2,#4            // xabs <= 4.0
         bnlt    r3,.LBB141_7
    ; %bb.2:
         fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
         bngt    r3,.LBB141_11
    ; %bb.3:
         fmul    r3,r1,r1
         fdiv    r3,#1,r3
         mov    r4,#0x3F90B4FB18B485C7        // p[5]
         fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
         fadd    r5,r3,#0x40048C54508800DB    // q[0]
         fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
         fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
         fmul    r4,r3,r4
         fmul    r6,r3,r6
         mov    r5,#2
         add    r7,sp,#40            // p[*]
         add    r8,sp,#0            // q[*]
    LBB141_4:                              ; %._crit_edge11
                                            ; =>This Inner Loop Header:
    Depth=1
         vec    r9,{r4,r6}
         ldd    r10,[r7,r5<<3,0]        // p[*]
         ldd    r11,[r8,r5<<3,0]        // q[*]
         fadd    r6,r6,r10
         fadd    r4,r4,r11
         fmul    r4,r3,r4
         fmul    r6,r3,r6
         loop    ne,r5,#4,#1
    ; %bb.5:
         fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
         fmul    r3,r3,r5
         fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
         fdiv    r3,r3,r4
         fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
         fdiv    r3,r3,r2
         br    .LBB141_10            // common tail
    LBB141_6:                              ; %._crit_edge
         fmul    r3,r1,r1
         fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
         sra    r2,r2,<1:13>
         cvtsd    r4,#0
         mux    r2,r2,r3,r4
         mov    r3,#0x3FC7C7905A31C322        // a[4]
         fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
         fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
         ldd    r4,[sp,104]            // a[2]
         fmac    r3,r2,r3,r4
         fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
         fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
         fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
         fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
         fmul    r1,r3,r1
         fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
         fdiv    r2,r1,r2
         mov    r1,r2
         add    sp,sp,#128
         ret                // 68
    LBB141_7:
         fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
         mov    r5,#0
         mov    r4,r2
    LBB141_8:                              ; =>This Inner Loop Header:
    Depth=1
         vec    r6,{r3,r4}
         ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
         fadd    r3,r3,r7
         fmul    r3,r2,r3
         ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
         fadd    r4,r4,r7
         fmul    r4,r2,r4
         loop    ne,r5,#7,#1
    ; %bb.9:
         fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
         fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
         fdiv    r3,r3,r4
    LBB141_10:                // common tail
         fmul    r4,r2,#0x41800000        // 16.0
         fmul    r4,r4,#0x3D800000        // 1/16.0
         cvtds    r4,r4                // (signed)double >>>      cvtsd    r4,r4                // (double)signed >>>      fadd    r5,r2,-r4
         fadd    r2,r2,r4
         fmul    r4,r4,-r4
         fexp    r4,r4                // exp()
         fmul    r2,r2,-r5
         fexp    r2,r2                // exp()
         fmul    r2,r4,r2
         fadd    r2,#0,-r2
         fmac    r2,r2,r3,#0x3F000000        // 0.5
         fadd    r2,r2,#0x3F000000        // 0.5
         pflt    r1,0,T
         fadd    r2,#0,-r2
         mov    r1,r2
         add    sp,sp,#128
         ret
    LBB141_11:
         fcmp    r1,r1,#0
         sra    r1,r1,<1:13>
         cvtsd    r2,#-1                // (double)-1 >>>      cvtsd    r3,#1                // (double)+1
         mux    r2,r1,r3,r2
         mov    r1,r2
         add    sp,sp,#128
         ret
    Lfunc_end141:
         .size    r8_erf, .Lfunc_end141-r8_erf
                                            ; -- End function

    These patterns seem rather unusual...
    Don't really know the ABI.

    Patterns don't really fit observations for typical compiler output
    though (mostly in the FP constants, and particular ones that fall
    outside the scope of what can be exactly represented as Binary16 or
    similar, are rare).

    You are N E V E R going to find the coefficients of a Chebyshev
    polynomial to fit in a small FP container; excepting the very
    occasional C0 or C1 term {which are mostly 1.0 and 0.0}

    FWIW:
    I went and was able to find some odd corners to fit full Binary64
    encodings into, but currently only in XG2 Mode (there will be
    equivalents in Baseline, albeit effectively truncated to 56 bits).

    So, went and burned the remaining bits in the Jumbo prefixes in XG2
    Mode, which now allows:
    ADD/SUB/AND/OR/XOR, and FADD/FMUL with full 64-bit immediate values.

    Ended up going and burning the MULS.L/MULU.L and ADDS.L/ADDU.L encodings
    on FADD and FMUL, so:
    FEii-iiii FEii-iiii F2nm-2Gii FMUL Rm, Imm56f, Rn
    FEii-iiii FEii-iiii F2nm-3Gii FADD Rm, Imm56f, Rn

    With Imm56f encoded like Imm57s (extended to 64-bit), but:
    Imm(55: 0) goes into (63: 8)
    Imm(62:56) goes into ( 7: 1)
    Imm( 63) goes into ( 0)

    Effects on code-size and performance: Minimal.
    Saves roughly 600 bytes from GLQuake;
    No real effect on the size of Doom.



    As for my fiddling with getting ".text" size down (testing mostly with
    Doom):
    XG2 : 293K
    RV64G: 285K

    Of this, there is still effectively 10K in Jumbo prefixes, and RV64G has
    an additional 24K in ".rodata".

    This implies, at the moment, XG2 instruction count is ~ 1% less than
    RV64G, albeit text slightly larger due to average instruction size being slightly bigger (~ 4.14 bytes).


    Also noted that the BGBCC compiler output has ~ 40K worth of MOV RegReg instructions, so MOV RegReg is around 14% of the total size of the
    binary (I suspect a big chunk of the relative size reduction of Baseline
    may be due to having a 16-bit encoding for this).

    So, one ongoing goal on the compiler optimization front would be
    continuing to work towards a reduction in the number of MOV instructions.

    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Apr 16 08:44:26 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    Terje Mathisen wrote:

    MitchAlsup1 wrote:


    In the non-OoO (i.e Pentium) days, I would have inverted the loop in
    order to hide the latencies as much as possible, resulting in an inner
    loop something like this:

      next:
       adc eax,ebx
       mov ebx,[edx+ecx*4]    ; First cycle

       mov [edi+ecx*4],eax
       mov eax,[esi+ecx*4]    ; Second cycle

       inc ecx
       jnz next        ; Third cycle

    Terje

    As opposed to::

        .global mpn_add_n
    mpn_add_n:
        MOV   R5,#0     // c
        MOV   R6,#0     // i

        VEC   R7,{}
        LDD   R8,[R2,Ri<<3]       // Load 128-to-512 bits
        LDD   R9,[R3,Ri<<3]       // Load 128-to-512 bits
        CARRY R5,{{IO}}
        ADD   R10,R8,R9           // Add pair to add octal
        STD   R10,[R1,Ri<<3]      // Store 128-to-512 bits
        LOOP  LT,R6,#1,R4         // increment 2-to-8 times
        RET

    --------------------------------------------------------

        LDD   R8,[R2,Ri<<3]       // AGEN cycle 1
        LDD   R9,[R3,Ri<<3]       // AGEN cycle 2 data cycle 4
        CARRY R5,{{IO}}
        ADD   R10,R8,R9           // cycle 4
        STD   R10,[R1,Ri<<3]      // AGEN cycle 3 write cycle 5
        LOOP  LT,R6,#1,R4         // cycle 3

    OR

        LDD       LDd
             LDD       LDd                    ADD
                  ST        STd
                  LOOP
                       LDD       LDd
                            LDD       LDd
    ADD
                                 ST        STd
                                 LOOP

    10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
    machine !!
    without code scheduling heroics.

    40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
    machine !!
    It all comes down to the carry propagation, right?
    The way I understood the original code, you are doing a very wide
    unsigned add, so you need a carry to propagate from each and every block to the next, right?
    If you can do that at half a clock cycle per 64 bit ADD, then consider
    me very impressed!
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 16 18:14:39 2024
    From Newsgroup: comp.arch

    Terje Mathisen wrote:

    MitchAlsup1 wrote:
    Terje Mathisen wrote:

    MitchAlsup1 wrote:


    In the non-OoO (i.e Pentium) days, I would have inverted the loop in
    order to hide the latencies as much as possible, resulting in an inner
    loop something like this:

      next:
       adc eax,ebx
       mov ebx,[edx+ecx*4]    ; First cycle

       mov [edi+ecx*4],eax
       mov eax,[esi+ecx*4]    ; Second cycle

       inc ecx
       jnz next        ; Third cycle

    Terje

    As opposed to::

        .global mpn_add_n
    mpn_add_n:
        MOV   R5,#0     // c
        MOV   R6,#0     // i

        VEC   R7,{}
        LDD   R8,[R2,Ri<<3]       // Load 128-to-512 bits
        LDD   R9,[R3,Ri<<3]       // Load 128-to-512 bits
        CARRY R5,{{IO}}
        ADD   R10,R8,R9           // Add pair to add octal
        STD   R10,[R1,Ri<<3]      // Store 128-to-512 bits
        LOOP  LT,R6,#1,R4         // increment 2-to-8 times
        RET

    --------------------------------------------------------

        LDD   R8,[R2,Ri<<3]       // AGEN cycle 1
        LDD   R9,[R3,Ri<<3]       // AGEN cycle 2 data cycle 4
        CARRY R5,{{IO}}
        ADD   R10,R8,R9           // cycle 4
        STD   R10,[R1,Ri<<3]      // AGEN cycle 3 write cycle 5
        LOOP  LT,R6,#1,R4         // cycle 3

    OR

        LDD       LDd
             LDD       LDd                    ADD
                  ST        STd
                  LOOP
                       LDD       LDd
                            LDD       LDd
    ADD
                                 ST        STd
                                 LOOP

    10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
    machine !!
    without code scheduling heroics.

    40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
    machine !!

    It all comes down to the carry propagation, right?

    The way I understood the original code, you are doing a very wide
    unsigned add, so you need a carry to propagate from each and every block
    to the next, right?

    Most ST pipelines have an align stage to align the data to be stored to where it needs to be stored, one can extend the carry into this stage if needed,
    and capture the a+b and a+b+1 and use carry in to select one or the other.

    If you can do that at half a clock cycle per 64 bit ADD, then consider
    me very impressed!

    Terje
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Apr 16 15:02:13 2024
    From Newsgroup: comp.arch

    On 4/3/2024 10:24 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If
    set, the bit indicates that the corresponding register contains a
    floating-point value. Clear indicates not floating point (integer,
    address, etc.). There would be two additional instructions, load single
    floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register. Non-floating-point loads would clear the
    tag bit. As I show below, I don’t think you need any special "store
    tag" instructions.
    ...
    But we can go further. There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions. And there are some
    operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts. Given the tag bit, these could share the same
    op-code. There may be several more of these.

    Certainly makes reading disassembler output fun (or writing the disassembler).

    Good point. It probably isn't too bad for the arithmetic operations,
    etc, but once you extend it as I suggested in the last paragraph it gets
    ugly. :-(


    big snip

    That is as far as I got. I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad. Is it
    worth it? To me, a major question is the effect on performance. What
    is the cost of having to decode the source registers and reading their
    respective tag bits before knowing which FU to use?

    In in OoO CPU, that's pretty heavy.

    OK, but in the vast majority of cases (i.e. unless there is something
    like a conditional branch that uses floating point or integer depending
    upon whether the branch is taken.) the flag bit that a register will
    have can be known well in advance. As I said, IANAHG, but that might
    make it easier.



    But actually, your idea does not need any computation results for
    determining the tag bits of registers (except during EXIT),

    But even here, you almost certainly know what the tag bit for any given register is long before you execute the EXIT instruction. And remember,
    on MY 66000 EXIT is performed lazily, so you have time and the mechanism
    is in place to wait if needed.


    so you
    probably can handle the tags in the front end (decoder and renamer).
    Then the tags are really separate and not part of the rgisters that
    have to be renamed, and you don't need to perform any waiting on
    ENTER.

    However, in EXIT the front end would have to wait for the result of
    the load/store unit loading the 32 bits, unless you add a special
    mechanism for that. So EXIT would become expensive, one way or the
    other.

    Yes.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Apr 16 15:06:48 2024
    From Newsgroup: comp.arch

    On 4/3/2024 11:44 AM, EricP wrote:
    Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the
    number of op codes.  One reason not mentioned before is if you have
    fixed length instructions, you may want to leave as many codes as
    possible available for future use.  Of course, if you are doing a
    16-bit instruction design, where instruction bits are especially
    tight, you may save enough op-codes to save a bit, perhaps allowing a
    larger register specifier field, or to allow more instructions in the
    smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s
    use of tags in registers, but not memory.  I worked through this idea
    using the My 6600 as an example “substrate” for two reasons.  First, >> it has several features that are “friendly” to the idea.  Second, I
    know Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea.  It is
    certainly not fully worked out.  I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.
    If set, the bit indicates that the corresponding register contains a
    floating-point value.  Clear indicates not floating point (integer,
    address, etc.).  There would be two additional instructions, load
    single floating and load double floating, which work the same as the
    other 32- and 64-bit loads, but in addition to loading the value, set
    the tag bit for the destination register.  Non-floating-point loads
    would clear the tag bit.  As I show below, I don’t think you need any
    special "store tag" instructions.

    If you are adding a float/int data type flag you might as well
    also add operand size for floats at least, though some ISA's
    have both int32 and int64 ALU operations for result compatibility.

    Not needed for My 66000, as all floating point loads convert the loaded
    value to double precision.

    big snip

    Currently the opcode data type can tell the uArch how to route
    the  operands internally without knowing the data values.
    For example, FPU reservation stations monitor float operands
    and schedule for just the FPU FADD or FMUL units.

    Dynamic data typing would change that to be data dependent routing.
    It means, for example, you can't begin to schedule a uOp
    until you know all its operand types and opcode.

    Seems right.



    Looks like it makes such distributed decisions impossible.
    Probably everything winds up in a big pile of logic in the center,
    which might be problematic for those things whose complexity grows N^2.
    Not sure how significant that is.

    Could be. Again, IANAHG.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Apr 16 15:08:58 2024
    From Newsgroup: comp.arch

    On 4/3/2024 1:02 PM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    [saving opcodes]


    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If
    set, the bit indicates that the corresponding register contains a
    floating-point value. Clear indicates not floating point (integer,
    address, etc.).

    I don't think this would save a lot of opcode space, which
    is the important thing.

    A typical RISC design has a six-bit major opcode.
    Having three registers takes away fifteen bits, leaving
    eleven, which is far more than anybody would ever want as
    minor opdoce for arithmetic instructions. Compare with https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
    where DEC actually left out three bits because they did not
    need them.

    I think that is probably true for 32 bit instructions, but what about 16
    bit?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Tue Apr 16 17:46:00 2024
    From Newsgroup: comp.arch

    On 4/16/2024 5:08 PM, Stephen Fuld wrote:
    On 4/3/2024 1:02 PM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    [saving opcodes]


    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.  If >>> set, the bit indicates that the corresponding register contains a
    floating-point value.  Clear indicates not floating point (integer,
    address, etc.).

    I don't think this would save a lot of opcode space, which
    is the important thing.

    A typical RISC design has a six-bit major opcode.
    Having three registers takes away fifteen bits, leaving
    eleven, which is far more than anybody would ever want as
    minor opdoce for arithmetic instructions.  Compare with
    https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
    where DEC actually left out three bits because they did not
    need them.

    I think that is probably true for 32 bit instructions, but what about 16 bit?


    At least, as I see it...


    If 4 bit registers:
    16-4-4 => 8
    If 5 bit registers:
    15-5-5 => 6

    Realistically, I don't think 6 bits of opcode is enough except if the
    purpose of the 16-bit ops is merely to shorten some common 32-bit ops.

    But, a subset of instructions can use 5-bit fields (say, MOV, EXTS.L,
    and common Load/Store ops).

    Say (in my notation):
    MOV Rm, Rn
    EXTS.L Rm, Rn
    MOV.L (SP, Disp), Rn
    MOV.Q (SP, Disp), Rn
    MOV.X (SP, Disp), Xn
    MOV.L Rn, (SP, Disp)
    MOV.Q Rn, (SP, Disp)
    MOV.X Xn, (SP, Disp)
    As, these tend to be some of the most commonly used instructions.

    For most everything else, one can limit things either to the first 16 registers, or the most commonly used 16 registers (if not equivalent to
    the first 16).

    Though, for 1R ops, it can make sense to have 5-bit registers.


    I don't really think 3-bit register fields are worth bothering with;
    even if limited to the most common registers. Granted, being limited to
    2R encodings is also limiting.

    Granted, both Thumb and RVC apparently thought 3-bit register fields
    were worthwhile, so...

    Similarly, not worth bothering (at all) with 6-bit register fields in
    16-bit ops.


    Though, if one has 16-bit VLE, a question is how is best to split up 16
    vs 32-bit encoding space.

    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Apr 16 16:44:27 2024
    From Newsgroup: comp.arch

    On 4/3/2024 2:30 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 4/3/2024 11:43 AM, Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the
    number of op codes.  One reason not mentioned before is if you have
    fixed length instructions, you may want to leave as many codes as
    possible available for future use.  Of course, if you are doing a
    16-bit instruction design, where instruction bits are especially
    tight, you may save enough op-codes to save a bit, perhaps allowing a
    larger register specifier field, or to allow more instructions in the
    smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s >>> use of tags in registers, but not memory.  I worked through this idea
    using the My 6600 as an example “substrate” for two reasons.  First, it
                   66000
    Sorry. Typo.



    has several features that are “friendly” to the idea.  Second, I know >>> Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea.  It is
    certainly not fully worked out.  I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.
    If set, the bit indicates that the corresponding register contains a
    floating-point value.  Clear indicates not floating point (integer,
    address, etc.).  There would be two additional instructions, load
    single floating and load double floating, which work the same as the
    other 32- and 64-bit loads, but in addition to loading the value, set
    the tag bit for the destination register.  Non-floating-point loads
    would clear the tag bit.  As I show below, I don’t think you need any >>> special "store tag" instructions.

    What do you do when you want a FP bit pattern interpreted as an integer,
    or vice versa.

    As I said below, if you need that, you can use an otherwise :"useless" instruction, such as ORing a register with itself the modify the tag bits.





    When executing arithmetic instructions, if the tag bits of both
    sources of an instruction are the same, do the appropriate operation
    (floating or integer), and set the tag bit of the result register
    appropriately.
    If the tag bits of the two sources are different, I see several
    possibilities.

    1.    Generate an exception.
    2.    Use the sense of source 1 for the arithmetic operation, but
    perform the appropriate conversion on the second operand first,
    potentially saving an instruction

    Conversions to/from FP often require a rounding mode. How do you specify that?

    Good point.




    3.    Always do the operation in floating point and convert the
    integer operand prior to the operation.  (Or, if you prefer, change
    floating point to integer in the above description.)
    4.    Same as 2 or 3 above, but don’t do the conversions.

    I suspect this is the least useful choice.  I am not sure which is
    the best option.

    Given that, use the same op code for the floating-point and fixed
    versions of the same operations.  So we can save eight op codes, the
    four arithmetic operations, max, min, abs and compare.  So far, a net
    savings of six opcodes.

    But we can go further.  There are some opcodes that only make sense
    for FP operands, e.g. the transcendental instructions.  And there are
    some operations that probably only make sense for non-FP operands,
    e.g. POP, FF1, probably shifts.  Given the tag bit, these could share
    the same op-code.  There may be several more of these.

    Hands waving:: "Danger Will Robinson, Danger" more waving of hands.

    Agreed.


    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data.  But what happens with
    separate compilations?  The called function probably doesn’t know the

    The compiler will certainly have a function prototype. In any event, if FP and Integers share a register file the lack of prototype is much less stress-
    full to the compiler/linking system.

    tag value for callee saved registers.  Fortunately, the My 66000
    architecture comes to the rescue here.  You would modify the Enter
    and Exit instructions to save/restore the tag bits of the registers
    they are saving or restoring in the same data structure it uses for
    the registers (yes, it adds 32 bits to that structure – minimal
    cost).  The same mechanism works for interrupts that take control
    away from a running process.

    Yes, but we do just fine without the tag and without the stuff mentioned above. Neither ENTER nor EXIT care about the 64-bit pattern in the
    register.

    I think you need it for callee saved registers to insure the tag is set correctly for the calling program upon return to it.



    I don’t think you need to set or clear the tag bits without doing
    anything else, but if you do, I think you could “repurpose” some
    other instructions to do this, without requiring another op-code.
    For example, Oring a register with itself could be used to set the
    tag bit and Oring a register with zero could clear it.  These should
    be pretty rare.

    That is as far as I got.  I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad.  Is it
    worth it?

    No.

               To me, a major question is the effect on performance.  What
    is the cost of having to decode the source registers and reading
    their respective tag bits before knowing which FU to use?

    The problem is you have put decode dependent on dynamic pipeline information.
    I suggest you don't want to do that. Consider a change from int to FP instruction
    as a predicated instruction, so the pipeline cannot DECODE the
    instruction at
    hand until the predicate resolves. Yech.

    Good point.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 17 01:11:12 2024
    From Newsgroup: comp.arch

    Stephen Fuld wrote:

    On 4/3/2024 11:44 AM, EricP wrote:


    If you are adding a float/int data type flag you might as well
    also add operand size for floats at least, though some ISA's
    have both int32 and int64 ALU operations for result compatibility.

    Not needed for My 66000, as all floating point loads convert the loaded value to double precision.

    Insufficient verbal precision::

    My 66000 only cares about the size of a value being loaded from memory
    (or ST into memory).

    While (float) LDs load the 32-bit value from memory, they remain (float)
    while residing in the register; and the High Order 32-bits are ignored.
    The (float) register can be consumed by a (float) FP calculation and it
    remains (float) after processing.

    Small immediates, when consumed by FP instructions, are converted from
    integer to <sized> FP during DECODE. So::

    FADD R7,R7,#1

    adds 1.0D0 to the (double) value in R7 (and takes one 32-bit instruction), while:

    FADDs R7,R7,#1

    Adds 1.0E0 to the (float) value in R7.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Wed Apr 17 15:06:05 2024
    From Newsgroup: comp.arch

    On Mon, 8 Apr 2024 17:25:38 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    John Savard <quadibloc@servername.invalid> schrieb:

    Well, when the computer fetches a 256-bit block of code, the first
    four bits indicates whether it is composed of 36-bit instructions or
    28-bit instructions.

    Do you think that instructions which require a certain size (almost)
    always happen to be situated together so they fit in a block?

    Well, floating-point and integer instructions of one size each can be arbitrarily mixed. And when different sizes need to mix, going to
    36-bit instructions is low overhead.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@quadibloc@servername.invalid to comp.arch on Wed Apr 17 15:07:18 2024
    From Newsgroup: comp.arch

    On Mon, 8 Apr 2024 19:56:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    So, instead of using the branch target address, one rounds it down to
    a 256-bit boundary, reads 256-bits and looks at the first 4-bits to
    determine the format, nd then uses the branch offset to pick a cont-
    tainer which will become the first instruction executed.

    Sounds more complicated than necessary.

    Yes, I don't disagree. I'm just pointing out that it's possible to
    make the mini tags idea work that way, since it lets you easily turn
    mini tags off when you need to.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sat Apr 20 19:19:53 2024
    From Newsgroup: comp.arch

    On 4/11/24 7:12 PM, MitchAlsup1 wrote:
    Scott Lurndal wrote:
    [snip]
    It seems to me that an offloaded DMA engine would be a far
    better way to do memmove (over some threshhold, perhaps a
    cache line) without trashing the caches.   Likewise memset.

    Effectively, that is what HW does, even on the lower end machines,
    the AGEN unit of the Cache access pipeline is repeatedly cycled,
    and data is read and/or written. One can execute instructions not
    needing memory references while LDM, STM, ENTER, EXIT, MM, and MS
    are in progress.

    Moving this sequencer farther out would still require it to consume
    all L1 BW in any event (snooping) for memory consistency reasons.
    {Note: cache accesses are performed line-wide not register width
    wide}

    If the data was not in L1 cache, only its absence would need to be
    determined by the DMA engine. A snoop filter, tag-inclusive L2/L3
    probing, or similar mechanism could avoid L1 accesses. Even if the
    source or destination for a memory copy was in L1, only one L1
    access per cache line might be needed.

    I also wonder if the cache fill and/or spill mechanism might be
    decoupled from the load/store such that if the cache had enough
    banks/subarrays some loads and stores might be done in parallel
    with a cache fill or spill/external-read-without-eviction. Tag
    checking would limit the utility of such, though tags might also
    be banked or access flexibly scheduled (at the cost of choosing a
    victim early for fills). Of course, if the cache has such
    available bandwidth, why not make it available to the core as well
    even if it was rarely useful? (Perhaps higher register bandwidth
    might be more difficult than higher cache bandwidth for banking-
    friendly patterns?)

    Deciding when to bypass cache seems difficult (for both software
    developers and hardware). Overwriting cache lines within the same
    memory copy is obviously silly. Filling a cache with a memory copy
    is also suboptimal, but L1 hardware copy-on-write would probably
    be too complicated even with page aligned copies. A copy from
    cacheable memory to uncacheable memory (I/O) might be a strong
    hint that the source should not be installed into L1 or L2 cache,
    but I would guess that not installing the source would often be
    the right choice.

    I could also imagine a programmer wanting to use memory copy as a
    prefetch *directive* for a large chunk of memory (by having source
    and destination be the same). This idiom would be easy to detect
    (from and to base registers being the same), but may be too niche
    to be worth detecting (for most implementations).

    (My 66000 might use an idiom with a prefetch instruction preceding
    a memory move to indicate the cache level of the destination but
    that only manages [some of] the difficulty of the hardware
    choice.)

    For memset, compression is also an obvious possibility. A memset
    might not write any cache lines but rather cache the address range
    and the set value and perform hardware copy on access into cache
    lines.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sat Apr 20 20:02:07 2024
    From Newsgroup: comp.arch

    On 4/11/24 10:30 AM, Scott Lurndal wrote:
    [snip]
    On 4/9/24 8:28 PM, MitchAlsup1 wrote:
    [snip]
    MMs and MSs that do not cross page boundaries are ATOMIC. The
    entire system
    sees only the before or only the after state and nothing in
    between.

    One might wonder how that atomicity is guaranteed in a
    SMP processor...

    While Mitch Alsup's response ("The entire chunk of data traverses
    the interconnect as a single transaction." — I am not certain how
    that would work given reading up to a page and writing up to a
    page) provides one mechanism and probably the best one,
    theoretically the *data* does not need to be moved atomically but
    only the "ownership" (the source does not have to be owned in the
    traditional sense but needs to marked as readable by the copier).
    This is somewhat similar to My 66000's Exotic Synchronization
    Mechanism in that once all the addresses involved are known (the
    two ranges for memory copy), NAKs can be used for remote requests
    for "owned" cache lines while the copy is made.

    Only the visibility needs to be atomic.

    Memory set provides optimization opportunities in that the source
    is small. In theory, the set value could be sent to L3 with the
    destination range and all monitoring could be done at L3 and
    requested cache line sent immediately from L3 (hardware copy on
    access) — the first and last part of the range might be partial
    cache lines requiring read-for-ownership.

    For cache line aligned copies, a cache which used indirection
    between tags and data might not even copy the data but only the
    tag-related metadata. Some forms of cache compression might allow
    partial cache lines to be cached such that even unaligned copies
    might partially share data by having one tag indicate lossy
    compression with an indication of where the stored data is not
    valid, but that seems too funky to be practical.
    --- Synchronet 3.20a-Linux NewsLink 1.114