Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
It's Groundhog Day, all over again!
So, is there a way to fix this while maintaining the feature's
performance advantage?
From what is written in the article, nothing is currently known.
For new silicon, people could finally implement Mitch's suggestion
of not committing speculative state before the instruction retires.
(It would be interesting to see how much area and power this
would cost with the hundreds of instructions in flight with
modern micro-architectures).
For existing silicon - run crypto on efficiency cores, or just--- Synchronet 3.20a-Linux NewsLink 1.114
make sure not to run untrusted code on your machine :-(
Thomas Koenig <tkoenig@netcologne.de> writes:
For existing silicon - run crypto on efficiency cores
Not the recent vulnerability that affects Intel's efficiency cores.
Also, if the prefetcher works with data in a shared cache (I don't
know whether the data-dependent prefetchers do that), it may not
matter on which core the code runs.
On 3/24/2024 2:47 PM, Lawrence D'Oliveiro wrote:
Even if it’s fixed, how does that help existing users with broken
machines?
The article gives several suggestions, but they all come at a
performance cost ...
On Mon, 25 Mar 2024 09:50:28 -0700, Stephen Fuld wrote:
On 3/24/2024 2:47 PM, Lawrence D'Oliveiro wrote:
Even if it’s fixed, how does that help existing users with broken
machines?
The article gives several suggestions, but they all come at a
performance cost ...
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really a
good idea.
On Mon, 25 Mar 2024 09:50:28 -0700, Stephen Fuld wrote:May be, not a good idea. But the best.
On 3/24/2024 2:47 PM, Lawrence D'Oliveiro wrote:
Even if it’s fixed, how does that help existing users with broken
machines?
The article gives several suggestions, but they all come at a
performance cost ...
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really
a good idea.
Lawrence D'Oliveiro wrote:
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really a
good idea.
Would you like to inform us of how it can be done otherwise ?
Now, personally I don't believe that for single-user platform like Mac
the threat is even remotely real and that any fix is needed.
Run it in non-cacheable memory. Slow but safe.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Also, if the prefetcher works with data in a shared cache (I don't
know whether the data-dependent prefetchers do that), it may not
matter on which core the code runs.
Run it in non-cacheable memory. Slow but safe.
On Mon, 25 Mar 2024 17:07:16 GMT, Scott Lurndal wrote:
Run it in non-cacheable memory. Slow but safe.
But 99% of the performance speedups of the last 20-30 years have involved >caching of some kind.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Also, if the prefetcher works with data in a shared cache (I don't
know whether the data-dependent prefetchers do that), it may not
matter on which core the code runs.
Run it in non-cacheable memory. Slow but safe.
To eliminate this particular vulnerability, it's sufficient to disable
the data-dependent prefetcher.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes: >https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
That's a pretty bad article, but at least one can read it without
JavaScript, unlike the web page of the vulnerability
<https://gofetch.fail/>.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:...
On Mon, 25 Mar 2024 17:07:16 GMT, Scott Lurndal wrote:
Run it in non-cacheable memory. Slow but safe.
Running the crypto algorithms (when not offloaded to
on-chip accelerators) using non-cacheable memory as a workaround
until the hardware issues are ameliorated doesn't imply that
all other code needs to run non-cachable.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Also, if the prefetcher works with data in a shared cache (I don't
know whether the data-dependent prefetchers do that), it may not
matter on which core the code runs.
Run it in non-cacheable memory. Slow but safe.
To eliminate this particular vulnerability, it's sufficient to disable
the data-dependent prefetcher.
That assumes that chicken bit(s) are available to do that.
Stephen Fuld wrote:
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
So, is there a way to fix this while maintaining the feature's
performance advantage?
They COULD start by not putting prefetched data into the cache
until after the predicting instruction retires. {{I have a note
from about 20 months ago where this feature was publicized and
the note indicates a potential side-channel.}}
An alternative is to notice that [*]cryption instructions are
being processed and turn DMP off during those intervals of time.
{Or both}.
Principle:: an Architecturally visible unit of data can only become
visible after the causing instruction retires. A high precision timer
makes cache line [dis]placement visible; so either take away the HPT
or don't alter cache visible state too early.
And we are off to the races, again.....
Principle:: an Architecturally visible unit of data can only become
visible after the causing instruction retires. A high precision
timer makes cache line [dis]placement visible; so either take away
the HPT or don't alter cache visible state too early.
And parallelism (e.g. multicores) can be used to emulate HPT, so "take
away the HPT" is not really an option.
Stefan
In case you missed it, the web page contains link to pdf: https://gofetch.fail/files/gofetch.pdf
Michael S <already5chosen@yahoo.com> schrieb:
In case you missed it, the web page contains link to pdf:
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
In case you missed it, the web page contains link to pdf:
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:It doesn't need to eat opcode space if you only support one data type,
In case you missed it, the web page contains link to pdf:Looking the paper, it seems that a separate "load value" instruction
https://gofetch.fail/files/gofetch.pdf
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
Hm, I'm not sure it would actually be used rarely, at least not
the way I thought about it.
I envisage a "ldp" (load pointer) instruction, which turns on
prefetaching, for everything that looks like
foo_t *p = some_expr;
which could also mean something like
*p = ptrarray[i];
with a scaled and indexed load (for example), where prefixing
is turned on, and a "ldd" (load double data) instruction where,
explicitly, for
long int n = some_other_expr;
where prefetching is explicitly disabled. (Apart from the security implicatins, this could also save a tiny bit of power).
On Mon, 25 Mar 2024 22:17:55 +0000, MitchAlsup1 wrote:
Lawrence D'Oliveiro wrote:
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really a
good idea.
Would you like to inform us of how it can be done otherwise ?
Upgradeable firmware/software, of course.
Michael S <already5chosen@yahoo.com> schrieb:
In case you missed it, the web page contains link to pdf:
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
EricP <ThatWouldBeTelling@thevillage.com> writes:
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
You lost me here. Do you mean that a load with address mode
[register] is considered to be a non-address load and not followed by
the data-dependent prefetcher? So how would an address load be
encoded if the natural expression would be [register]?
- anton
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:It doesn't need to eat opcode space if you only support one data type,
In case you missed it, the web page contains link to pdf:Looking the paper, it seems that a separate "load value" instruction
https://gofetch.fail/files/gofetch.pdf
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
Hm, I'm not sure it would actually be used rarely, at least not
the way I thought about it.
I'm referring to your load with prefetch disable.
For these particular loads it's users could likely tolerate the
"overhead" of an extra LEA instruction to calculate the address,
and don't need all 7 integer data types.
Thomas Koenig <tkoenig@netcologne.de> writes:
Michael S <already5chosen@yahoo.com> schrieb:
In case you missed it, the web page contains link to pdf:
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Michael S <already5chosen@yahoo.com> schrieb:
In case you missed it, the web page contains link to pdf:
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction >>>(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being >>>loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
It is worth noting (from the paper's Introduction):
In particular, Augury reported that the [Apple M-series ed.] DMP only activates
in the presence of a rather idiosyncratic program memory
access pattern (where the program streams through an array
of pointers and architecturally dereferences those pointers).
This access pattern is not typically found in security critical
software such as side-channel hardened constant-time code--
hence making that code impervious to leakage through the
DMP.
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
You lost me here. Do you mean that a load with address mode
[register] is considered to be a non-address load and not followed by
the data-dependent prefetcher? So how would an address load be
encoded if the natural expression would be [register]?
- anton
I'm pointing out that not all instructions need to be orthogonal.
There can be savings in opcode space by tempering that based on
expected frequency of occurrence.
The normal LD and ST have all their address modes and data types
because these functions occur frequently enough that we deem it
worthwhile to support these all in one instruction,
such as supporting both sign and zero extended loads
or scaled index addressing.
I note there is this class of relatively rarely used special purpose
memory access instructions that don't need to have all singing and all dancing address modes and/or data types like the regular LD and ST.
Since I need a LEA Load Effective Address instruction anyway
which does rBase+rIndex*scale+offset calculation
(plus I have others, like where rBase is RIP or an absolute address),
then I can drop all but the [reg] address mode for these rare instructions and in many cases drop some sign or zero extend types for loads.
For example, I use just two opcodes for Atomic Fetch Add int64 and int32--- Synchronet 3.20a-Linux NewsLink 1.114
AFADD8 rDst,rSrc,[rAddr]
AFADD4 rDst,rSrc,[rAddr]
On 3/25/2024 5:27 PM, Lawrence D'Oliveiro wrote:
On Mon, 25 Mar 2024 22:17:55 +0000, MitchAlsup1 wrote:
Lawrence D'Oliveiro wrote:
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really
a good idea.
Would you like to inform us of how it can be done otherwise ?
Upgradeable firmware/software, of course.
But microcode is generally slower than dedicated hardware, and most
people seem to be unwilling to give up performance all the time to gain
an advantage in a situation that occurs infrequently and mostly never.
On Wed, 27 Mar 2024 09:22:12 -0700, Stephen Fuld wrote:
On 3/25/2024 5:27 PM, Lawrence D'Oliveiro wrote:
On Mon, 25 Mar 2024 22:17:55 +0000, MitchAlsup1 wrote:
Lawrence D'Oliveiro wrote:
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really >>>>> a good idea.
Would you like to inform us of how it can be done otherwise ?
Upgradeable firmware/software, of course.
But microcode is generally slower than dedicated hardware, and most
people seem to be unwilling to give up performance all the time to gain
an advantage in a situation that occurs infrequently and mostly never.
Bruce Schneier has a saying: “attacks never get worse, they can only get better”.
EricP wrote:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
It doesn't need to eat opcode space if you only support one data type, >>>> 64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't
matter.
You lost me here. Do you mean that a load with address mode
[register] is considered to be a non-address load and not followed by
the data-dependent prefetcher? So how would an address load be
encoded if the natural expression would be [register]?
- anton
I'm pointing out that not all instructions need to be orthogonal.
There can be savings in opcode space by tempering that based on
expected frequency of occurrence.
The normal LD and ST have all their address modes and data types
because these functions occur frequently enough that we deem it
worthwhile to support these all in one instruction,
such as supporting both sign and zero extended loads
or scaled index addressing.
I note there is this class of relatively rarely used special purpose
memory access instructions that don't need to have all singing and all
dancing address modes and/or data types like the regular LD and ST.
Since I need a LEA Load Effective Address instruction anyway
which does rBase+rIndex*scale+offset calculation
(plus I have others, like where rBase is RIP or an absolute address),
then I can drop all but the [reg] address mode for these rare
instructions
and in many cases drop some sign or zero extend types for loads.
It seems to me that once the core has identified an address and an offset from that address contains another address (foo->next, foo->prev) that
only those are prefetched. So this depends on placing next as the first container in a structure and remains dependent on chasing next a lot more often than chasing prev.
Otherwise, knowing a loaded value contains a pointer to a structure (or array)
one cannot predict what to prefetch unless one can assume the offset
into the
struct (or array).
Now Note:: If there were an instruction that loaded the value known to be
a pointer and prefetched based on the received pointer, then the prefetch
is now architectural not µArchitectural and you are allowed to damage the cache or TLB when/after the instruction retires.
MitchAlsup1 wrote:
It seems to me that once the core has identified an address and an offset
from that address contains another address (foo->next, foo->prev) that
only those are prefetched. So this depends on placing next as the first
container in a structure and remains dependent on chasing next a lot more
often than chasing prev.
Otherwise, knowing a loaded value contains a pointer to a structure (or
array)
one cannot predict what to prefetch unless one can assume the offset
into the
struct (or array).
Right, this is the problem that these "data memory-dependent" prefetchers like described in that Intel Programmable and Integrated Unified Memory Architecture (PIUMA)" paper referenced by Paul Clayton are trying to solve.
The pointer field to chase can be
(a) at an +- offset from the current pointer virtual address
(b) at a different offset for each iteration
(c) conditional on some other field at some other offset
and most important:
(d) any new pointers are virtual address that have to start back at
the Load Store Queue for VA translation and forwarding testing
after applying (a),(b) and (c) above.
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
So I find the simplistic, blithe data-dependent auto prefetching
described as questionable.
Now Note:: If there were an instruction that loaded the value known to be
a pointer and prefetched based on the received pointer, then the prefetch
is now architectural not µArchitectural and you are allowed to damage the >> cache or TLB when/after the instruction retires.
In the PIUMA case those pointers were to sparse data sets
so part of the problem was rolling over the cache, as well as
(and the PIUMA paper didn't mention this) the TLB.
After reading the PIUMA paper I had an idea for a small modification
to the PTE cache control bits to handle sparse data. The PTE's 3 CC bits
can specify the upper page table levels are cached in the TLB but
lower levels are not because they would always roll over the TLB.
However the non-TLB cached PTE's may optionally still be cached
in L1 or L2, or not at all.
This allows one to hold the top page table levels in the TLB,
the upper middle levels in L1, lower middle levels in L2,
and leaf PTE's and sparse code/data not cached at all.
BUT, as PIUMA proposes, we also allow the memory subsystem to read and write individual aligned 8-byte values from DRAM, rather than whole cache lines,
so we only move that actual 8 bytes values we need.
Also note that page table walks are also graph structure walks
but chasing physical addresses at some simple calculated offsets.
These physical addresses might be cached in L1 or L2 so we can't
just chase these pointers in the memory controller but,
if one wants to do this, have to do so in the cache controller.
EricP wrote:[snip]
(d) any new pointers are virtual address that have to start back at
the Load Store Queue for VA translation and forwarding testing
after applying (a),(b) and (c) above.
This is the tidbit that prevents doing prefetches at/in the DRAM controller. The address so fetched needs translation !! And this requires dragging
stuff over to DRC that is not normally done.
BUT, as PIUMA proposes, we also allow the memory subsystem to
read and write
individual aligned 8-byte values from DRAM, rather than whole
cache lines,
so we only move that actual 8 bytes values we need.
Busses on cores are reaching the stage where an entire cache line
is transferred in 1-cycle. With such busses, why define anything
smaller than a cache line ?? {other than uncacheable accesses}
On 3/28/24 3:59 PM, MitchAlsup1 wrote:
EricP wrote:[snip]
(d) any new pointers are virtual address that have to start back at
the Load Store Queue for VA translation and forwarding testing
after applying (a),(b) and (c) above.
This is the tidbit that prevents doing prefetches at/in the DRAM controller. >> The address so fetched needs translation !! And this requires dragging
stuff over to DRC that is not normally done.
With multiple memory channels having independent memory
controllers (a reasonable design I suspect), a memory controller
may have to send the prefetch request to another memory controller
anyway.
Busses on cores are reaching the stage where an entire cache line
is transferred in 1-cycle. With such busses, why define anything
smaller than a cache line ?? {other than uncacheable accesses}
The Intel research chip was special-purpose targeting
cache-unfriendly code. Reading 64 bytes when 99% of the time 56
bytes would be unused is rather wasteful (and having more memory
channels helps under high thread count).
However, even for a "general purpose" processor, "word"-granular
atomic operations could justify not having all data transfers be
cache line size. (Such are rare compared with cache line loads
from memory or other caches, but a design might have narrower
connections for coherence, interrupts, etc. that could be used for
small data communication.)
"Paul A. Clayton" <paaronclayton@gmail.com> writes:[snip]
On 3/28/24 3:59 PM, MitchAlsup1 wrote:
This is the tidbit that prevents doing prefetches at/in the DRAM controller.
The address so fetched needs translation !! And this requires dragging
stuff over to DRC that is not normally done.
With multiple memory channels having independent memory
controllers (a reasonable design I suspect), a memory controller
may have to send the prefetch request to another memory controller
anyway.
Which is usually handled by the LLC when the address space is
striped across multiple memory controllers.
Busses on cores are reaching the stage where an entire cache line
is transferred in 1-cycle. With such busses, why define anything
smaller than a cache line ?? {other than uncacheable accesses}
The Intel research chip was special-purpose targeting
cache-unfriendly code. Reading 64 bytes when 99% of the time 56
bytes would be unused is rather wasteful (and having more memory
channels helps under high thread count).
Given the lack of both spatial and temporal locality in that
workload, one wonders if the data should be cached at all.
However, even for a "general purpose" processor, "word"-granular
atomic operations could justify not having all data transfers be
cache line size. (Such are rare compared with cache line loads
from memory or other caches, but a design might have narrower
connections for coherence, interrupts, etc. that could be used for
small data communication.)
So long as the data transfer is cachable, the atomics can be handled
at the LLC, rather than the memory controller.
On 3/29/24 10:15 AM, Scott Lurndal wrote:
"Paul A. Clayton" <paaronclayton@gmail.com> writes:[snip]
On 3/28/24 3:59 PM, MitchAlsup1 wrote:
However, even for a "general purpose" processor, "word"-granular
atomic operations could justify not having all data transfers be
cache line size. (Such are rare compared with cache line loads
from memory or other caches, but a design might have narrower
connections for coherence, interrupts, etc. that could be used for
small data communication.)
So long as the data transfer is cachable, the atomics can be handled
at the LLC, rather than the memory controller.
Yes, but if the width of the on-chip network — which is what Mitch
was referring to in transferring a cache line in one cycle — is
c.72 bytes (64 bytes for the data and 8 bytes for control
information) it seems that short messages would either have to be
grouped (increasing latency) or waste a significant fraction of
the potential bandwidth for that transfer. Compressed cache lines
would also not save bandwidth. These may not be significant
considerations, but this is an answer to "why define anything
smaller than a cache line?", i.e., seemingly reasonable
motivations may exist.
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
On 3/29/24 10:15 AM, Scott Lurndal wrote:
"Paul A. Clayton" <paaronclayton@gmail.com> writes:[snip]
On 3/28/24 3:59 PM, MitchAlsup1 wrote:
However, even for a "general purpose" processor, "word"-granular
atomic operations could justify not having all data transfers be
cache line size. (Such are rare compared with cache line loads
from memory or other caches, but a design might have narrower
connections for coherence, interrupts, etc. that could be used for
small data communication.)
So long as the data transfer is cachable, the atomics can be handled
at the LLC, rather than the memory controller.
Yes, but if the width of the on-chip network — which is what Mitch
was referring to in transferring a cache line in one cycle — is
c.72 bytes (64 bytes for the data and 8 bytes for control
information) it seems that short messages would either have to be
grouped (increasing latency) or waste a significant fraction of
the potential bandwidth for that transfer. Compressed cache lines
would also not save bandwidth. These may not be significant
considerations, but this is an answer to "why define anything
smaller than a cache line?", i.e., seemingly reasonable
motivations may exist.
It's not uncommon for the bus/switch/mesh -structure- to be 512-bits wide, which indeed will support a full cache line transfer in a single transaction;
it also supports high-volume DMA operations (either memory to memory or device to memory).
Most of the interconnect (bus, switched or point-to-point) implementations have an
overlaying protocol (including the cache coherency
protocol) and are effectively message based, with agents posting requests that don't need a reply and expecting a reply for the rest.
That doesn't require that every transaction over that bus to
utilize the full width of the bus.
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches),
but ought to be quite good
at predicting prefetches.
If a pointer is loaded, chances are
very high that are it will be dereferenced.
Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
Since each chased pointer starts back at LSQ, the cost is no different >>>> than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches),
Which are always predicted to have no error.
but ought to be quite good
at predicting prefetches.
What makes you think programmers understand prefetches any better than exceptions ??
If a pointer is loaded, chances are
very high that are it will be dereferenced.
What if the value loaded is NULL.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
Since each chased pointer starts back at LSQ, the cost is no different >>>>> than an explicit Prefetch instruction, except without (a),(b) and (c) >>>>> having been applied first.
I thought the important difference is that the decision to prefetch or >>>> not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches),
Which are always predicted to have no error.
On the second or third time, certainly. Hmmm... given hot/cold
splitting which is fairly standard by now, do branch predictors
take this into account?
but ought to be quite good
at predicting prefetches.
What makes you think programmers understand prefetches any better than
exceptions ??
Pointers are used in many common data structures; linked list,
trees, ... A programmer who does not know about dereferencing
pointers should be kept away from computer keyboards, preferably
at a distance of at least 3 m.
--- Synchronet 3.20a-Linux NewsLink 1.114
If a pointer is loaded, chances are
very high that are it will be dereferenced.
What if the value loaded is NULL.
Then it should be trivially predicted that it should not be prefetched.
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
Since each chased pointer starts back at LSQ, the cost is no different >>>>>> than an explicit Prefetch instruction, except without (a),(b) and (c) >>>>>> having been applied first.
I thought the important difference is that the decision to prefetch or >>>>> not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches),
Which are always predicted to have no error.
There I mean that the programmer wrote the code::
if( no error so far )
{
then continue
}
else
{
deal with the error
}
Many times, the "deal with the error" code is never even fetched.
On the second or third time, certainly. Hmmm... given hot/cold
splitting which is fairly standard by now, do branch predictors
take this into account?
First we are talking about predicting branches at compile time and
the way the programmer writes the source code, not about the dynamic >predictions of HW.
mitchalsup@aol.com (MitchAlsup1) writes:
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
Since each chased pointer starts back at LSQ, the cost is no different >>>>>>> than an explicit Prefetch instruction, except without (a),(b) and (c) >>>>>>> having been applied first.
I thought the important difference is that the decision to prefetch or >>>>>> not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches),
Which are always predicted to have no error.
There I mean that the programmer wrote the code::
if( no error so far )
{
then continue
}
else
{
deal with the error
}
Many times, the "deal with the error" code is never even fetched.
On the second or third time, certainly. Hmmm... given hot/cold
splitting which is fairly standard by now, do branch predictors
take this into account?
First we are talking about predicting branches at compile time and
the way the programmer writes the source code, not about the dynamic >>predictions of HW.
gcc provides a way to "annotate" a condition with the expected
common result:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
if (likely(bus_enable.s.enabled)) {
do something
} else {
do something else
}
This will affect the layout of the code (e.g. deferring generation
of the else clause with the result that it ends up in a different
cache line or page).
It's used in the linux kernel, and in certain cpu bound applications.
Programmers and compilers are notoriously bad at predictingSince each chased pointer starts back at LSQ, the cost is no differentI thought the important difference is that the decision to prefetch or
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
not can be done dynamically based on past history.
branches (except for error branches), but ought to be quite good
at predicting prefetches. If a pointer is loaded, chances are
very high that are it will be dereferenced.
Programmers and compilers are notoriously bad at predictingSince each chased pointer starts back at LSQ, the cost is no different >>>> than an explicit Prefetch instruction, except without (a),(b) and (c)I thought the important difference is that the decision to prefetch or
having been applied first.
not can be done dynamically based on past history.
branches (except for error branches), but ought to be quite good
at predicting prefetches. If a pointer is loaded, chances are
very high that are it will be dereferenced.
I don't think it's that simple: prefetches only bring the data into L1
cache, so they're only useful if:
- The data is not already in L1.
- The data will be used soon (i.e. before it gets thrown away from the cache).
- The corresponding load doesn't occur right away.
In all other cases, the prefetch will be just wasted work.
It's easy for programmers to "predict" those (dependent) loads which will occur
right away, but those don't really benefit from a prefetch.
E.g. if the dependent load is done 2 cycles later, performing a prefetch
lets you start the memory access 2 cycles early, but since that access
is not in L1 it'll take more than 10 cycles, so shaving
2 cycles off isn't of great benefit.
Given that we're talking about performing a prefetch on the result of
a previous load, and loads tend to already have a fairly high latency
(3-5 cycles), "2 cycles later" really means "5-7 cycles after the
beginning of the load of that pointer". That can easily translate to 20 instructions later.
My gut feeling is that it's difficult for programmers to predict what
will happen more than 20 instructions further without looking at
detailed profiling.
Stefan--- Synchronet 3.20a-Linux NewsLink 1.114
Given that it is compile time, one either predicts it is taken
(loops) or not taken (errors and initialization) and arrange
the code such that fall through is the predicted pattern (except
for loops).
Then at run time, all these branches are predicted with thestandard
predictors present in the core.
Initialization stuff is mispredicted once or twice
error code is only mispredicted when an error occurs
loops are mispredicted once or twice per entrance.
Also note:: With an ISA like My 66000, one can preformbranching using
predication and neither predict the branch nor modify whereFETCH is
fetching. Ideally, predication should deal with hard to predictbranches
and all flow control where the then and else clauses are short.When
these are removed from the predictor, prediction shouldimprove--maybe
not in the number of predictions that are correct, but in thetotal time
wasted on branching (including both repair and misfetchingoverheads).
On 4/4/24 4:09 PM, MitchAlsup1 wrote:> Thomas Koenig wrote:
[snip]
Also note:: With an ISA like My 66000, one can preformbranching using
predication and neither predict the branch nor modify whereFETCH is
fetching. Ideally, predication should deal with hard to predictbranches
and all flow control where the then and else clauses are short.When
these are removed from the predictor, prediction shouldimprove--maybe
not in the number of predictions that are correct, but in thetotal time
wasted on branching (including both repair and misfetchingoverheads).
Rarely-executed blocks should presumably use branches even when
short to remove the rarely-executed code from the normal
instruction stream. I would guess that exceptional actions are
typically longer/more complicated.
(Consistent timing would also be important for some real-time
tasks and for avoiding timing side channels.)
The best performing choice would also seem to be potentially microarchitecture-dependent. Obviously the accuracy of branch
prediction and the cost of aliasing would matter (and
perversely mispredicting a branch can _potentially_ improve
performance, though not on My 66000, I think, because more
persistent microarchitectural state is not updated until
instruction commitment).
If the predicate value is delayed and predicated operations
wait in the scheduler for this operand and the operands of one
path are available before the predicate value, branch prediction
might allow deeper speculation.
predictable but short branches, deeper speculation might help
more than fetch irregularity hurts.
(The predicate could be
predicted — and this prediction is needed later in the pipeline —
but not distinguishing between prediction-allowed predication--- Synchronet 3.20a-Linux NewsLink 1.114
and normal predication might prevent prediction from being
implemented to avoid data-dependent timing of predication.)
(The cost of speculation can also be variable. With underutilized
resources (thermal, memory bandwidth, etc.) speculation would
generally be less expensive than with high demand on resources.)
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 915 |
Nodes: | 10 (2 / 8) |
Uptime: | 38:27:57 |
Calls: | 12,170 |
Calls today: | 2 |
Files: | 186,521 |
Messages: | 2,234,417 |