Forum: War Ensemble BBS

Re: Combining Practicality with Perfection

From BGB@cr88192@gmail.com to comp.arch on Thu Feb 19 02:10:07 2026

From Newsgroup: comp.arch

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in hardware. The
additional hardware cost (or the cost of trapping and software
emulation) has been the only argument against denormals that I ever
encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals became
a low cost addition. {And that has been my point--you seem to have
forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

Well, and the secondary irony that it is mainly cost-added for FMUL,
whereas FADD almost invariably has the necessary support hardware already.

But:
FMUL is expensive operation + cheap normalizer (if no denormals);
FADD is cheap operation with expensive normalizer.

FMAC then is gluing the costs of the two units together, but:
With roughly the latency of both;
The need to be significantly wider internally to deal with some cases.

So, FMAC is a single unit that costs more than both units taken
separately, and with a higher latency.

FMAC does suddenly get a bit cheaper if its scope is limited to
FP8*FP8+FP16, but this operation is a bit niche.

This one makes a lot of sense for NN's, but haven't gotten my NN tech
working good enough to make a strong use-case for it.

Where, in terms of algorithmic or behavioral complexity relative to computational efficiency, NN's are significantly behind what is possible
with genetic algorithms or genetic programming.

So, for computational efficiency of the result:
Hand-written native code, best efficiency;
Genetic algorithm, moderate efficiency;
Neural Net, very inefficient.

The merit of NNs could then be if one could make them adaptive in some practical way:
Native code: No adaptation apart from specific algos;
Genetic algorithms: Only when running the evolver, static otherwise;
NN's: Could be made adaptable in theory, usually fixed in practice.

And, adaptation process:
Native: None, maybe manual fiddling by programmer;
Genetic algo: Initially very slow, gradually converges on answer;
NNs, via generic algorithm: Slow, but converges toward an answer;
NNs, via backprop: Rapid adaptation initially, then hits a plateau.

Backprop is seemingly prone to get stuck at a non-optimal solution, and
then is hard pressed to make any further progress. Seemingly isn't
really able to "fix" any obvious structural defects once it hits a
plateau, but can sometimes jump up or down between various nearby
options (when obvious suboptimal patterns persist).

Some tricks that work with GA-NN's don't really work with backprop, and
my initial attempts to glue GA handling onto backprop have not been
effective. Also it seems to need at least FP16 weights for training to
work effectively (though, one other option being FP8 with a bias
counter; but this is effectively analogous to using a non-standard
S.E4.M11 format).

Seemingly, my own efforts are getting stuck at the level of very
inefficiently solving very mundane issues, nowhere near the success
being seen by more mainstream efforts.

Nor, as of yet, even anything particularly interesting...

Had started making some progress in other types of areas though, for
example:
Figured out a practical way to get below 16kbps for audio...

By using 8kHz ADPCM and then using lookup table and reversed LZ search trickery to make the audio more LZ compressible (without changing the
storage format).

Or, basically, ADPCM encoding strategy like:
Lookup a match for the last 4 bytes;
Look for the longest backwards match (last N bytes);
Evaluate if the next byte for pattern is within an error limit;
Select based on combination of error and length
Longer matches permit more error than shorter ones.
Check a pattern table,
seeing if anything is within an acceptable error limit;
Use pattern if so.
Else:
Figure out best-match for next 6 samples,
using this to encode next 4 samples (1 byte).

Was able to get around a 20-30% reduction in bitrate, or around 12 kbps typical, before loss of audio quality becomes unacceptable (starts
breaking down in obvious ways).

Did version for 4-bit ADPCM, which can get a roughly similar reduction,
or around 24 kbps, though trying to push it much lower makes 2-bit ADPCM preferable.

A slightly higher reduction rate is possible if the baseline sample-rate
is increased to 16kHz, but still doesn't get as low as when using 8 kHz.

Note that it is possible to just use a pattern table directly to give an equivalent of 8 kbps ADPCM (each byte encoding an index into an 8-sample table, which is then decoded as 2-bit ADPCM), but the audio quality is unacceptably poor (for much of any use-case).

Though, all this was mostly dusting off an experiment from last year,
and putting it to use in my packaging tool (inside BGBCC).

Mostly it is a case of:
It is "good enough" to at least allow for optional super-compression of
ADPCM without breaking the existing decoders.

...

- anton

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 19 17:30:50 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in hardware. The
additional hardware cost (or the cost of trapping and software
emulation) has been the only argument against denormals that I ever
encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals became
a low cost addition. {And that has been my point--you seem to have forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Well, and the secondary irony that it is mainly cost-added for FMUL,
whereas FADD almost invariably has the necessary support hardware already.

But:
FMUL is expensive operation + cheap normalizer (if no denormals);
FADD is cheap operation with expensive normalizer.

FMAC then is gluing the costs of the two units together, but:
With roughly the latency of both;
The need to be significantly wider internally to deal with some cases.

The add stage after the multiplication tree is <essentially> 2× as wide.
FMUL needs a 108-bit 2-input adder
FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
The multiplication tree is the same, normalizer is larger.

So, FMAC is a single unit that costs more than both units taken
separately, and with a higher latency.

Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

--- Synchronet 3.21b-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Feb 19 20:15:29 2026

From Newsgroup: comp.arch

On Thu, 19 Feb 2026 17:30:50 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in
hardware. The additional hardware cost (or the cost of trapping
and software emulation) has been the only argument against
denormals that I ever encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals
became a low cost addition. {And that has been my point--you seem
to have forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Well, and the secondary irony that it is mainly cost-added for
FMUL, whereas FADD almost invariably has the necessary support
hardware already.

But:
FMUL is expensive operation + cheap normalizer (if no denormals);
FADD is cheap operation with expensive normalizer.

FMAC then is gluing the costs of the two units together, but:
With roughly the latency of both;
The need to be significantly wider internally to deal with some
cases.

The add stage after the multiplication tree is <essentially> 2� as
wide. FMUL needs a 108-bit 2-input adder
FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
The multiplication tree is the same, normalizer is larger.

So, FMAC is a single unit that costs more than both units taken separately, and with a higher latency.

Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

Arm Inc. application processors cores have FMAC latency=4 for
multiplicands, but 2 for accumulator.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Feb 19 18:49:22 2026

From Newsgroup: comp.arch

John Dallman <jgd@cix.co.uk> schrieb:

Quadi, have your computer architectures included IBM 360 floating point support? There is probably more demand for that than for 36-bit these
days.

It has been quite a few decades since the last large-scale
scientific calculations in IBM hex float; I believe it must have
been the Japanese vector computers (one of which I worked on in
the mid to late 1990s). It is probably safe to say that any
hex float these days is embedded firmly in the z ecosystem.

Since every laptop these days has more performance than the old
vector computers, I very much doubt that there is significant data
saved in that format. Same thing for VAX floating point formats.

Bit vs. little endian data could is more recent. Around 20 years
ago, I wrote code to convert between big- and little endian data
for gfortran. This is also quite irrelevant today.

The last conversion issue I had a hand in was for IBM's "double
double" 128-bit real. Now POWER supports this as IEEE in hardware
(if not very fast), but this ABI change is very painful.

There could, however, be a niche for 36-bit reals - graphics cards.
I have recently discovered a GPU solver in a commercial package that
I use, and it has an option for using 32-bit reals. 36-bit reals
could extend the usefulness of such a solver.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 19 19:55:40 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Thu, 19 Feb 2026 17:30:50 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in
hardware. The additional hardware cost (or the cost of trapping
and software emulation) has been the only argument against
denormals that I ever encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals
became a low cost addition. {And that has been my point--you seem
to have forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Well, and the secondary irony that it is mainly cost-added for
FMUL, whereas FADD almost invariably has the necessary support
hardware already.

But:
FMUL is expensive operation + cheap normalizer (if no denormals);
FADD is cheap operation with expensive normalizer.

FMAC then is gluing the costs of the two units together, but:
With roughly the latency of both;
The need to be significantly wider internally to deal with some
cases.

The add stage after the multiplication tree is <essentially> 2× as
wide. FMUL needs a 108-bit 2-input adder
FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
The multiplication tree is the same, normalizer is larger.

So, FMAC is a single unit that costs more than both units taken separately, and with a higher latency.

Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

Arm Inc. application processors cores have FMAC latency=4 for
multiplicands, but 2 for accumulator.

Thank you for that tid-bit of information.

--- Synchronet 3.21b-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Fri Feb 20 08:14:46 2026

From Newsgroup: comp.arch

On Wed, 18 Feb 2026 08:40:38 -0500, Robert Finch wrote:

Maybe we should switch to 18-bit bytes to support UNICODE.

It's true that Unicode has expanded beyond the old 16-bit Basic
Multilingual Plane. But while all currently-defined characters would fit
in 18 bits, they envisage enlarging Unicode to as many as 31 bits; that is what UTF-8 supports.

If 9-bit bytes are used for simple applications, it certainly will be true that 18-bit halfwords will be an available data type.

John Savard
--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Feb 20 05:08:28 2026

From Newsgroup: comp.arch

On 2/19/2026 11:30 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in hardware. The >>>> additional hardware cost (or the cost of trapping and software
emulation) has been the only argument against denormals that I ever
encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals became
a low cost addition. {And that has been my point--you seem to have
forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Likely depends.

Can use the trick of bumping to the next size up and use that for
computation.

So, for Binary32 compute it as Binary64, and for Binary64 compute it as Binary128.

Can special case the "Binary64 * Binary64 => Binary128" case to save
cost over using a native Binary128 multiply.

For Binary128 multiply, can also make sense to detect and special-case
the "low order bits are zero" case:
If low-order bits are zero, can use a multiply that only produces the
high 128 bits;
Vs a transient 128*128=>256 bit, and then needing to round.

Relative cost is lower if one is already paying the cost of a trap
handler or similar (except that if the ISA supports it, you really don't
want the compiler to combine these operations).

So, one can maybe document if using a compiler like GCC to use "-fp-contract=off -fno-fdiv", ...

Well, and the secondary irony that it is mainly cost-added for FMUL,
whereas FADD almost invariably has the necessary support hardware already. >>
But:
FMUL is expensive operation + cheap normalizer (if no denormals);
FADD is cheap operation with expensive normalizer.

FMAC then is gluing the costs of the two units together, but:
With roughly the latency of both;
The need to be significantly wider internally to deal with some cases.

The add stage after the multiplication tree is <essentially> 2× as wide. FMUL needs a 108-bit 2-input adder
FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
The multiplication tree is the same, normalizer is larger.

A 160-bit 3-way adder happening "quickly" is still kinda asking a lot though...

Though, granted, the first step is deciding to decide full-width
multiply, and not discard the low order results.

Granted discarding the low results reduces rounding accuracy, but a way
to fake full IEEE rounding was to detect this case and have the FMUL
raise a fault (similar to denormal/underflow handling). Though, does
mean there is a performance penalty if multiplying numbers where the
low-order bits in both values are non-zero.

In my ISA, the exact behavior depends on instruction an rounding mode.
In the RISC-V mode, partly based on instruction rounding mode and and
flags settings.

For reasons though, can't safely enable full IEEE emulation until after setting up virtual memory and similar though.

The handling of the RISC-V F/D extensions was non-standard in my case,
though not in a way that effects GCC output (it seems to exclusively use
the DYN rounding mode in instructions, assuming the rounding mode to be handled via CSR's). Also, ironically and contrasting with the seeming
design of these extensions, these registers are so rarely accessed in
practice that it seemed most sensible to use trap-and-emulate for the CSRs.

Granted, there are limits to corner cutting:
If a design does not produce exact results for cases where it is trivial
to verify that an exact answer exists in cases that do not require
rounding, IMO this is below the minimum limit for a usable general
purpose FPU.

So, FMAC is a single unit that costs more than both units taken
separately, and with a higher latency.

Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

Trying to push the latency down would be pretty bad for timing, unless
there is some cheaper way to implement FPUs that I am not aware of.

In my case:
FMADD.D, RM=DYN: Trap
FMADD.D, RM=RNE, 10-cycle, double-rounded (non-standard)
FMADD.S, RM=DYN, 10-cycle (mimics single rounding, *)
*: Happens internally at Binary64 precision.

It could be possible to handle FMADD.D RM=DYN the same way as RNE
internally, but then trap if the inputs would potentially give a
non-IEEE result. Though, for now, trapping is the cheaper solution in
terms of HW cost.

The one exception is FP8*FP8 + FP16, but mostly because it is possible
to do FP8*FP8 under 1 cycle.

But, still not free here; and overly niche. Ended up going with a
cheaper option of simply having an SIMD FP8*FP8=>FP16 multiply op (which
still ends up as a 2-cycle op, because...).

--- Synchronet 3.21b-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Feb 20 15:22:05 2026

From Newsgroup: comp.arch

BGB wrote:

On 2/19/2026 11:30 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in hardware. The >>>>> additional hardware cost (or the cost of trapping and software
emulation) has been the only argument against denormals that I ever>>>>> encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals became
a low cost addition. {And that has been my point--you seem to have
forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Likely depends.

Can use the trick of bumping to the next size up and use that for computation.

So, for Binary32 compute it as Binary64, and for Binary64 compute it as Binary128.

Neither of those work!
I believed this to be true but I was shown the error of my thinking by
more knowledgable people in the 754 working group. I.e. they had a very simple/small example where doing the calculation in the next higher
precision would still cause double rounding errors.
Also note that Mitch have stated multiple times that you need ~160
mantissa bits during FMAC double calculations.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Feb 20 15:26:24 2026

From Newsgroup: comp.arch

On 2/20/2026 8:22 AM, Terje Mathisen wrote:

BGB wrote:

On 2/19/2026 11:30 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in hardware. >>>>>> The
additional hardware cost (or the cost of trapping and software
emulation) has been the only argument against denormals that I ever >>>>>> encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals became
a low cost addition. {And that has been my point--you seem to have
forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Likely depends.

Can use the trick of bumping to the next size up and use that for
computation.

So, for Binary32 compute it as Binary64, and for Binary64 compute it
as Binary128.

Neither of those work!

I believed this to be true but I was shown the error of my thinking by
more knowledgable people in the 754 working group. I.e. they had a very simple/small example where doing the calculation in the next higher precision would still cause double rounding errors.

Also note that Mitch have stated multiple times that you need ~160
mantissa bits during FMAC double calculations.

Could look into this, next option being to use a makeshift 192-bit FP
format with a 176 bit mantissa (likely cheaper than going all the way to
224 bits).

This is slow/annoying, but not really likely a "hard" problem (when one
is already doing this stuff in software in a trap handler).

So, potentially:
Binary32 -> FP96 (truncated Binary128, still stored as Binary128)
Binary64 -> FP192 (extended Binary128)
Binary128 -> FP384 (likewise)
Big/ugly, but no one says this needs to be fast...

Might end up on a sort of "TODO list"...

In any case, actual native hardware support for single-rounded FMA is
unlikely to happen in my case.

...

Terje

--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sat Feb 21 20:27:25 2026

From Newsgroup: comp.arch

On 2/19/26 12:30 PM, MitchAlsup wrote:
[snip]

The add stage after the multiplication tree is <essentially> 2× as wide. FMUL needs a 108-bit 2-input adder
FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
The multiplication tree is the same, normalizer is larger.

How much of that applies to a double-rounded FMADD? Double-
rounding would have the advantage of giving bit-identical
results.

While a single-rounding presents more algorithmic opportunities,
double-rounded FMADD would still save decode bandwidth (and
issue bandwidth if a pair of FMUL and FADD instructions were not
fused by idiom recognition) at the cost of supporting three-
input operations (with reduced forwarding per unit work).

Getting bit-identical results for FMUL and FADD executed
separately or as a fused operation and an ISA extension FMADD
might have some practical benefit.

(I still wonder if an FP execution model that only calculated to
the integer size precision (64-bit integer for 64-bit FP),
ignoring carry-in, might have been acceptable for 99% of uses
and saved a little bit of power and area as well as potentially
facilitated software FP — to implement FMUL without carry-in one
would have to have an integer multiply high result that did not
use carry-in (a mirror of multiply low execution) which would
probably only be useful for multiplication by reciprocal as
integer results are expected to be exact.)
--- Synchronet 3.21b-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Noozle
  Tue Mar 24 07:21:41 2026
  from Noozle City via Telnet
- Microbot
  Tue Mar 24 05:28:46 2026
  from Moore, Ok via Telnet
- Toadster
  Mon Mar 23 13:54:04 2026
  from Ashland, Wi via Telnet
- Noozle
  Mon Mar 23 07:17:29 2026
  from Noozle City via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,105
Nodes:	10 (0 / 10)
Uptime:	492344:25:34
Calls:	14,158
Calls today:	2
Files:	186,284
D/L today:	2,036 files (795M bytes)
Messages:	2,502,672

Re: Combining Practicality with Perfection

Who's Online

Recent Visitors

System Info