• Re: Combining Practicality with Perfection

    From BGB@cr88192@gmail.com to comp.arch on Thu Feb 19 02:10:07 2026
    From Newsgroup: comp.arch

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in hardware. The
    additional hardware cost (or the cost of trapping and software
    emulation) has been the only argument against denormals that I ever
    encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals became
    a low cost addition. {And that has been my point--you seem to have
    forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.


    Well, and the secondary irony that it is mainly cost-added for FMUL,
    whereas FADD almost invariably has the necessary support hardware already.

    But:
    FMUL is expensive operation + cheap normalizer (if no denormals);
    FADD is cheap operation with expensive normalizer.

    FMAC then is gluing the costs of the two units together, but:
    With roughly the latency of both;
    The need to be significantly wider internally to deal with some cases.

    So, FMAC is a single unit that costs more than both units taken
    separately, and with a higher latency.



    FMAC does suddenly get a bit cheaper if its scope is limited to
    FP8*FP8+FP16, but this operation is a bit niche.


    This one makes a lot of sense for NN's, but haven't gotten my NN tech
    working good enough to make a strong use-case for it.

    Where, in terms of algorithmic or behavioral complexity relative to computational efficiency, NN's are significantly behind what is possible
    with genetic algorithms or genetic programming.


    So, for computational efficiency of the result:
    Hand-written native code, best efficiency;
    Genetic algorithm, moderate efficiency;
    Neural Net, very inefficient.

    The merit of NNs could then be if one could make them adaptive in some practical way:
    Native code: No adaptation apart from specific algos;
    Genetic algorithms: Only when running the evolver, static otherwise;
    NN's: Could be made adaptable in theory, usually fixed in practice.

    And, adaptation process:
    Native: None, maybe manual fiddling by programmer;
    Genetic algo: Initially very slow, gradually converges on answer;
    NNs, via generic algorithm: Slow, but converges toward an answer;
    NNs, via backprop: Rapid adaptation initially, then hits a plateau.

    Backprop is seemingly prone to get stuck at a non-optimal solution, and
    then is hard pressed to make any further progress. Seemingly isn't
    really able to "fix" any obvious structural defects once it hits a
    plateau, but can sometimes jump up or down between various nearby
    options (when obvious suboptimal patterns persist).

    Some tricks that work with GA-NN's don't really work with backprop, and
    my initial attempts to glue GA handling onto backprop have not been
    effective. Also it seems to need at least FP16 weights for training to
    work effectively (though, one other option being FP8 with a bias
    counter; but this is effectively analogous to using a non-standard
    S.E4.M11 format).


    Seemingly, my own efforts are getting stuck at the level of very
    inefficiently solving very mundane issues, nowhere near the success
    being seen by more mainstream efforts.

    Nor, as of yet, even anything particularly interesting...



    Had started making some progress in other types of areas though, for
    example:
    Figured out a practical way to get below 16kbps for audio...

    By using 8kHz ADPCM and then using lookup table and reversed LZ search trickery to make the audio more LZ compressible (without changing the
    storage format).

    Or, basically, ADPCM encoding strategy like:
    Lookup a match for the last 4 bytes;
    Look for the longest backwards match (last N bytes);
    Evaluate if the next byte for pattern is within an error limit;
    Select based on combination of error and length
    Longer matches permit more error than shorter ones.
    Check a pattern table,
    seeing if anything is within an acceptable error limit;
    Use pattern if so.
    Else:
    Figure out best-match for next 6 samples,
    using this to encode next 4 samples (1 byte).

    Was able to get around a 20-30% reduction in bitrate, or around 12 kbps typical, before loss of audio quality becomes unacceptable (starts
    breaking down in obvious ways).


    Did version for 4-bit ADPCM, which can get a roughly similar reduction,
    or around 24 kbps, though trying to push it much lower makes 2-bit ADPCM preferable.

    A slightly higher reduction rate is possible if the baseline sample-rate
    is increased to 16kHz, but still doesn't get as low as when using 8 kHz.



    Note that it is possible to just use a pattern table directly to give an equivalent of 8 kbps ADPCM (each byte encoding an index into an 8-sample table, which is then decoded as 2-bit ADPCM), but the audio quality is unacceptably poor (for much of any use-case).


    Though, all this was mostly dusting off an experiment from last year,
    and putting it to use in my packaging tool (inside BGBCC).

    Mostly it is a case of:
    It is "good enough" to at least allow for optional super-compression of
    ADPCM without breaking the existing decoders.


    ...




    - anton

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 19 17:30:50 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in hardware. The
    additional hardware cost (or the cost of trapping and software
    emulation) has been the only argument against denormals that I ever
    encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals became
    a low cost addition. {And that has been my point--you seem to have forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Well, and the secondary irony that it is mainly cost-added for FMUL,
    whereas FADD almost invariably has the necessary support hardware already.

    But:
    FMUL is expensive operation + cheap normalizer (if no denormals);
    FADD is cheap operation with expensive normalizer.

    FMAC then is gluing the costs of the two units together, but:
    With roughly the latency of both;
    The need to be significantly wider internally to deal with some cases.

    The add stage after the multiplication tree is <essentially> 2× as wide.
    FMUL needs a 108-bit 2-input adder
    FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
    The multiplication tree is the same, normalizer is larger.


    So, FMAC is a single unit that costs more than both units taken
    separately, and with a higher latency.

    Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
    Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Feb 19 20:15:29 2026
    From Newsgroup: comp.arch

    On Thu, 19 Feb 2026 17:30:50 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in
    hardware. The additional hardware cost (or the cost of trapping
    and software emulation) has been the only argument against
    denormals that I ever encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals
    became a low cost addition. {And that has been my point--you seem
    to have forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Well, and the secondary irony that it is mainly cost-added for
    FMUL, whereas FADD almost invariably has the necessary support
    hardware already.

    But:
    FMUL is expensive operation + cheap normalizer (if no denormals);
    FADD is cheap operation with expensive normalizer.

    FMAC then is gluing the costs of the two units together, but:
    With roughly the latency of both;
    The need to be significantly wider internally to deal with some
    cases.

    The add stage after the multiplication tree is <essentially> 2ª as
    wide. FMUL needs a 108-bit 2-input adder
    FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
    The multiplication tree is the same, normalizer is larger.


    So, FMAC is a single unit that costs more than both units taken separately, and with a higher latency.

    Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
    Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

    Arm Inc. application processors cores have FMAC latency=4 for
    multiplicands, but 2 for accumulator.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Feb 19 18:49:22 2026
    From Newsgroup: comp.arch

    John Dallman <jgd@cix.co.uk> schrieb:

    Quadi, have your computer architectures included IBM 360 floating point support? There is probably more demand for that than for 36-bit these
    days.

    It has been quite a few decades since the last large-scale
    scientific calculations in IBM hex float; I believe it must have
    been the Japanese vector computers (one of which I worked on in
    the mid to late 1990s). It is probably safe to say that any
    hex float these days is embedded firmly in the z ecosystem.

    Since every laptop these days has more performance than the old
    vector computers, I very much doubt that there is significant data
    saved in that format. Same thing for VAX floating point formats.

    Bit vs. little endian data could is more recent. Around 20 years
    ago, I wrote code to convert between big- and little endian data
    for gfortran. This is also quite irrelevant today.

    The last conversion issue I had a hand in was for IBM's "double
    double" 128-bit real. Now POWER supports this as IEEE in hardware
    (if not very fast), but this ABI change is very painful.

    There could, however, be a niche for 36-bit reals - graphics cards.
    I have recently discovered a GPU solver in a commercial package that
    I use, and it has an option for using 32-bit reals. 36-bit reals
    could extend the usefulness of such a solver.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 19 19:55:40 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Thu, 19 Feb 2026 17:30:50 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in
    hardware. The additional hardware cost (or the cost of trapping
    and software emulation) has been the only argument against
    denormals that I ever encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals
    became a low cost addition. {And that has been my point--you seem
    to have forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Well, and the secondary irony that it is mainly cost-added for
    FMUL, whereas FADD almost invariably has the necessary support
    hardware already.

    But:
    FMUL is expensive operation + cheap normalizer (if no denormals);
    FADD is cheap operation with expensive normalizer.

    FMAC then is gluing the costs of the two units together, but:
    With roughly the latency of both;
    The need to be significantly wider internally to deal with some
    cases.

    The add stage after the multiplication tree is <essentially> 2× as
    wide. FMUL needs a 108-bit 2-input adder
    FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
    The multiplication tree is the same, normalizer is larger.


    So, FMAC is a single unit that costs more than both units taken separately, and with a higher latency.

    Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
    Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).


    Arm Inc. application processors cores have FMAC latency=4 for
    multiplicands, but 2 for accumulator.

    Thank you for that tid-bit of information.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri Feb 20 08:14:46 2026
    From Newsgroup: comp.arch

    On Wed, 18 Feb 2026 08:40:38 -0500, Robert Finch wrote:

    Maybe we should switch to 18-bit bytes to support UNICODE.

    It's true that Unicode has expanded beyond the old 16-bit Basic
    Multilingual Plane. But while all currently-defined characters would fit
    in 18 bits, they envisage enlarging Unicode to as many as 31 bits; that is what UTF-8 supports.

    If 9-bit bytes are used for simple applications, it certainly will be true that 18-bit halfwords will be an available data type.

    John Savard
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Feb 20 05:08:28 2026
    From Newsgroup: comp.arch

    On 2/19/2026 11:30 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in hardware. The >>>> additional hardware cost (or the cost of trapping and software
    emulation) has been the only argument against denormals that I ever
    encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals became
    a low cost addition. {And that has been my point--you seem to have
    forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Likely depends.


    Can use the trick of bumping to the next size up and use that for
    computation.

    So, for Binary32 compute it as Binary64, and for Binary64 compute it as Binary128.

    Can special case the "Binary64 * Binary64 => Binary128" case to save
    cost over using a native Binary128 multiply.

    For Binary128 multiply, can also make sense to detect and special-case
    the "low order bits are zero" case:
    If low-order bits are zero, can use a multiply that only produces the
    high 128 bits;
    Vs a transient 128*128=>256 bit, and then needing to round.



    Relative cost is lower if one is already paying the cost of a trap
    handler or similar (except that if the ISA supports it, you really don't
    want the compiler to combine these operations).

    So, one can maybe document if using a compiler like GCC to use "-fp-contract=off -fno-fdiv", ...



    Well, and the secondary irony that it is mainly cost-added for FMUL,
    whereas FADD almost invariably has the necessary support hardware already. >>
    But:
    FMUL is expensive operation + cheap normalizer (if no denormals);
    FADD is cheap operation with expensive normalizer.

    FMAC then is gluing the costs of the two units together, but:
    With roughly the latency of both;
    The need to be significantly wider internally to deal with some cases.

    The add stage after the multiplication tree is <essentially> 2× as wide. FMUL needs a 108-bit 2-input adder
    FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
    The multiplication tree is the same, normalizer is larger.


    A 160-bit 3-way adder happening "quickly" is still kinda asking a lot though...


    Though, granted, the first step is deciding to decide full-width
    multiply, and not discard the low order results.

    Granted discarding the low results reduces rounding accuracy, but a way
    to fake full IEEE rounding was to detect this case and have the FMUL
    raise a fault (similar to denormal/underflow handling). Though, does
    mean there is a performance penalty if multiplying numbers where the
    low-order bits in both values are non-zero.


    In my ISA, the exact behavior depends on instruction an rounding mode.
    In the RISC-V mode, partly based on instruction rounding mode and and
    flags settings.

    For reasons though, can't safely enable full IEEE emulation until after setting up virtual memory and similar though.


    The handling of the RISC-V F/D extensions was non-standard in my case,
    though not in a way that effects GCC output (it seems to exclusively use
    the DYN rounding mode in instructions, assuming the rounding mode to be handled via CSR's). Also, ironically and contrasting with the seeming
    design of these extensions, these registers are so rarely accessed in
    practice that it seemed most sensible to use trap-and-emulate for the CSRs.


    Granted, there are limits to corner cutting:
    If a design does not produce exact results for cases where it is trivial
    to verify that an exact answer exists in cases that do not require
    rounding, IMO this is below the minimum limit for a usable general
    purpose FPU.



    So, FMAC is a single unit that costs more than both units taken
    separately, and with a higher latency.

    Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
    Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).


    Trying to push the latency down would be pretty bad for timing, unless
    there is some cheaper way to implement FPUs that I am not aware of.

    In my case:
    FMADD.D, RM=DYN: Trap
    FMADD.D, RM=RNE, 10-cycle, double-rounded (non-standard)
    FMADD.S, RM=DYN, 10-cycle (mimics single rounding, *)
    *: Happens internally at Binary64 precision.

    It could be possible to handle FMADD.D RM=DYN the same way as RNE
    internally, but then trap if the inputs would potentially give a
    non-IEEE result. Though, for now, trapping is the cheaper solution in
    terms of HW cost.




    The one exception is FP8*FP8 + FP16, but mostly because it is possible
    to do FP8*FP8 under 1 cycle.

    But, still not free here; and overly niche. Ended up going with a
    cheaper option of simply having an SIMD FP8*FP8=>FP16 multiply op (which
    still ends up as a 2-cycle op, because...).


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Feb 20 15:22:05 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 2/19/2026 11:30 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in hardware.  The >>>>> additional hardware cost (or the cost of trapping and software
    emulation) has been the only argument against denormals that I ever>>>>> encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals became
    a low cost addition. {And that has been my point--you seem to have
    forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Likely depends.


    Can use the trick of bumping to the next size up and use that for computation.

    So, for Binary32 compute it as Binary64, and for Binary64 compute it as Binary128.
    Neither of those work!
    I believed this to be true but I was shown the error of my thinking by
    more knowledgable people in the 754 working group. I.e. they had a very simple/small example where doing the calculation in the next higher
    precision would still cause double rounding errors.
    Also note that Mitch have stated multiple times that you need ~160
    mantissa bits during FMAC double calculations.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Feb 20 15:26:24 2026
    From Newsgroup: comp.arch

    On 2/20/2026 8:22 AM, Terje Mathisen wrote:
    BGB wrote:
    On 2/19/2026 11:30 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in hardware. >>>>>> The
    additional hardware cost (or the cost of trapping and software
    emulation) has been the only argument against denormals that I ever >>>>>> encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals became
    a low cost addition. {And that has been my point--you seem to have
    forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Likely depends.


    Can use the trick of bumping to the next size up and use that for
    computation.

    So, for Binary32 compute it as Binary64, and for Binary64 compute it
    as Binary128.

    Neither of those work!

    I believed this to be true but I was shown the error of my thinking by
    more knowledgable people in the 754 working group. I.e. they had a very simple/small example where doing the calculation in the next higher precision would still cause double rounding errors.

    Also note that Mitch have stated multiple times that you need ~160
    mantissa bits during FMAC double calculations.


    Could look into this, next option being to use a makeshift 192-bit FP
    format with a 176 bit mantissa (likely cheaper than going all the way to
    224 bits).

    This is slow/annoying, but not really likely a "hard" problem (when one
    is already doing this stuff in software in a trap handler).


    So, potentially:
    Binary32 -> FP96 (truncated Binary128, still stored as Binary128)
    Binary64 -> FP192 (extended Binary128)
    Binary128 -> FP384 (likewise)
    Big/ugly, but no one says this needs to be fast...


    Might end up on a sort of "TODO list"...


    In any case, actual native hardware support for single-rounded FMA is
    unlikely to happen in my case.


    ...

    Terje


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sat Feb 21 20:27:25 2026
    From Newsgroup: comp.arch

    On 2/19/26 12:30 PM, MitchAlsup wrote:
    [snip]
    The add stage after the multiplication tree is <essentially> 2× as wide. FMUL needs a 108-bit 2-input adder
    FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
    The multiplication tree is the same, normalizer is larger.

    How much of that applies to a double-rounded FMADD? Double-
    rounding would have the advantage of giving bit-identical
    results.

    While a single-rounding presents more algorithmic opportunities,
    double-rounded FMADD would still save decode bandwidth (and
    issue bandwidth if a pair of FMUL and FADD instructions were not
    fused by idiom recognition) at the cost of supporting three-
    input operations (with reduced forwarding per unit work).

    Getting bit-identical results for FMUL and FADD executed
    separately or as a fused operation and an ISA extension FMADD
    might have some practical benefit.

    (I still wonder if an FP execution model that only calculated to
    the integer size precision (64-bit integer for 64-bit FP),
    ignoring carry-in, might have been acceptable for 99% of uses
    and saved a little bit of power and area as well as potentially
    facilitated software FP — to implement FMUL without carry-in one
    would have to have an integer multiply high result that did not
    use carry-in (a mirror of multiply low execution) which would
    probably only be useful for multiplication by reciprocal as
    integer results are expected to be exact.)
    --- Synchronet 3.21b-Linux NewsLink 1.2