Forum: War Ensemble BBS

Reverse engineering of Intel branch predictors

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 23 18:36:16 2024

From Newsgroup: comp.arch

Seems like Intel branch predictors have been pretty completely reverse-engineered. The following paper promises to for very
interesting reading:

https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi

I wonder what you think of this...
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 23 21:06:41 2024

From Newsgroup: comp.arch

On Wed, 23 Oct 2024 18:36:16 +0000, Thomas Koenig wrote:

Seems like Intel branch predictors have been pretty completely reverse-engineered. The following paper promises to for very
interesting reading:

https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi

I wonder what you think of this...

A couple of points (likely perinate only to me)::

In MY 66000 ISA::
a) RET is not predicted
b) switch() is not predicted
c) method calls are not predicted
d) GOT calls are not predicted

Which pretty much gets rid of the problem.

c+d) GOT calls and method calls use the CALX instruction which
loads IP from memory--thus not needing prediction--and not using
a trampoline, either.

a) When RET is seen in the instruction stream more than 1 cycle
in front of the DECODE point, various mechanisms are used to
fetch instructions at the return address.

b) switch() is the JTT instruction
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Oct 28 13:55:29 2024

From Newsgroup: comp.arch

In MY 66000 ISA::
a) RET is not predicted
b) switch() is not predicted
c) method calls are not predicted
d) GOT calls are not predicted

Which pretty much gets rid of the problem.

By "the problem" I guess you mean "indirect jumps", right?

c+d) GOT calls and method calls use the CALX instruction which
loads IP from memory--thus not needing prediction--and not using
a trampoline, either.

I don't understand the "thus not needing prediction". Loading IP from
memory takes time, doesn't it? Depending on your memory hierarchy and
where the data is held, I'd say a minimum of 3 cycles and often more.
What do you do during those cycles?

b) switch() is the JTT instruction

IIUC this is basically like CALX, except with a bit more work to
generate the address from which you fetch the IP (plus a bit more work
to generate the IP from the fetched data as well). So: same question.

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Nov 1 19:04:38 2024

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> wrote:

Seems like Intel branch predictors have been pretty completely reverse-engineered. The following paper promises to for very
interesting reading:

https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi

I wonder what you think of this...

There are more papers on this topic. There were several papers
on variations of Spectre. I think that there is simple condition
which guarantees that nothing Spectre-related affects given
processor: the sequence of microarchitecutral operations (incuding
speculative operations) should depend only on architecturaly
executed instructions. So, processor may do widely speculative
things, but only base speculation on architecturaly executed
instructions. Some people try to just close single hole at
a time, IMO it is hopeless, there are too many possible
variations. And weaker conditions, like "cancelling" effects
of speculative instructions are likely to fail.

My impression is that it is relatively easy to modify Intel
scheme to depend only on architectural state. Impact of such
restriction on performance is not clear. In case of branch
predictor itself it means delay feedback by some number of clocks,
which looks like minor cost. OTOH delaying fetches from
speculatively fetched addresses will increase latency on
critical path, possibly leading to significant slowdown.
--
Waldek Hebisch
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Nov 1 21:21:53 2024

From Newsgroup: comp.arch

On Fri, 1 Nov 2024 19:04:38 +0000, Waldek Hebisch wrote:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Seems like Intel branch predictors have been pretty completely
reverse-engineered. The following paper promises to for very
interesting reading:

https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi

I wonder what you think of this...

There are more papers on this topic. There were several papers
on variations of Spectre. I think that there is simple condition
which guarantees that nothing Spectre-related affects given
processor: the sequence of microarchitecutral operations (incuding speculative operations) should depend only on architecturaly
executed instructions.

The easiest way to state this succinctly is::
No microarchitectural state can be updated until the instruction
causing said update retires. Caches, TLBs, and all predictor
state(s) has to be included in the above.

So, processor may do widely speculative
things, but only base speculation on architecturaly executed
instructions.

The processor can maintain look-ahead predictor state and use it
at decode/issue but it has to verify said state in order to retire.

Some people try to just close single hole at
a time, IMO it is hopeless, there are too many possible
variations.

So far its been almost t10 years of closing one hole after another
and they are still closing holes. So, I agree with the hopeless-
ness of that line of attack.

And weaker conditions, like "cancelling" effects
of speculative instructions are likely to fail.

These also lead to replay-storms.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Nov 5 14:26:28 2024

From Newsgroup: comp.arch

In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.

You can still make your next predictions based on "architectural state
+ pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.

OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.

I think you can similarly perform eagerly the fetches from speculatively fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.

So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting
that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some
sort of load buffer.

If "leave no trace" includes not slowing down other concurrent memory
accesses (e.g. from other CPUs), it might require some kind of
priority scheme.

If you're worried about, say, a Spectre-like attack measuring the
temperature or the power consumption of the CPU to try and guess what operations were performed (or what addresses were accessed, ...)
speculatively, then you'd have to be yet more careful.

If you're worried about an attacker that can witnessing the actual
accesses to the DRAM, ...

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Nov 5 20:17:46 2024

From Newsgroup: comp.arch

On Tue, 5 Nov 2024 19:26:28 +0000, Stefan Monnier wrote:

In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.

You can still make your next predictions based on "architectural state
+ pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.

Key word:: Ultimately !

OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.

I think you can similarly perform eagerly the fetches from speculatively fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.

This is also how one recovers an out-of-memory-order speculative
memref--you quit the OoO request and replay it from AGEN. This
allows for relaxed memory order performance while achieving
stronger memory order when OoO, that could cause order-harm,
is detected.

So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting
that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some
sort of load buffer.

Leave no trace, AND have a plan to recover should an non µArchitectural
event become visible.

If "leave no trace" includes not slowing down other concurrent memory accesses (e.g. from other CPUs), it might require some kind of
priority scheme.

Once you include the "plan to recover" you can loosen up the
operational memory order rules and tighten them upon discovery
of order violations.

If you're worried about, say, a Spectre-like attack measuring the
temperature or the power consumption of the CPU to try and guess what operations were performed (or what addresses were accessed, ...) speculatively, then you'd have to be yet more careful.

If you're worried about an attacker that can witnessing the actual
accesses to the DRAM, ...

This brings to mind something we found in the Mc 88120 project.
We could take a checkpoint (write to disk) and restore later--
but we found we had to also recover the electrical state of the DRAM
Bus and to RAS the proper banks to get the "same timing" on later
runs on that checkpoint.

Stefan

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Nov 5 20:41:22 2024

From Newsgroup: comp.arch

On Mon, 28 Oct 2024 17:55:29 +0000, Stefan Monnier wrote:

In MY 66000 ISA::
a) RET is not predicted
b) switch() is not predicted
c) method calls are not predicted
d) GOT calls are not predicted

Which pretty much gets rid of the problem.

By "the problem" I guess you mean "indirect jumps", right?

Mostly, but I also changed on what the jump is predicted.

c+d) GOT calls and method calls use the CALX instruction which
loads IP from memory--thus not needing prediction--and not using
a trampoline, either.

I don't understand the "thus not needing prediction". Loading IP from
memory takes time, doesn't it? Depending on your memory hierarchy and
where the data is held, I'd say a minimum of 3 cycles and often more.
What do you do during those cycles?

It is not that these things don't need prediction, it is that you do the prediction and then verify the prediction using different data.

For example: The classical way to do dense switches is a LD of the
target
address and a jump to the target. This requires verifying the address of
the target. Whereas if you predict as JTT does, you verify by matching
the index number (which is known earlier and since the table is
read-only
you don't need to verify the target address.

So, it is not that you don't predict, it is that the data used to
verify the prediction is more precise and available earlier.

b) switch() is the JTT instruction

IIUC this is basically like CALX, except with a bit more work to
generate the address from which you fetch the IP (plus a bit more work
to generate the IP from the fetched data as well). So: same question.

So, CALX is a LDD IP,[address] and generally LDD IP,[IP,#GOT[extern]-.]
Since GOT is not writeable from the thread using it (my architecture
has this requirement) and GOT is DW aligned; we can
a) avoid the LDAlign stage of the pipeline
b) feed the loaded value directly into FETCH
saving 2 cycles, while

c) for those situations where we predict through GOT, we verify the
GOT offset (DISP) instead of the loaded GOT value (an entry point).

And since different kinds of data is used in the prediction and verification--the strategies used in the above paper are not
actual attack strategies in My 66000 implementations.

Stefan

--- Synchronet 3.20a-Linux NewsLink 1.114

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Nov 8 17:54:45 2024

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> wrote:

In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.

You can still make your next predictions based on "architectural state
+ pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.

OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.

I think you can similarly perform eagerly the fetches from speculatively fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.

It looks extremaly hard if not impossible.

So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting
that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some
sort of load buffer.

Alone that is clearly insufficient.

If "leave no trace" includes not slowing down other concurrent memory accesses (e.g. from other CPUs), it might require some kind of
priority scheme.

First, one needs to ensure that the CPU performing speculative
fetch will not slown down due to say resource contention. If you
put some arbitrary limit like one or two speculative fetches in
flight, that is likely to be detectable by the attacker and may
leak information. If you want several ("arbitrarily many") speculative
fetches without slowing down normal execution, that would mean highly overprovisioned machine.

If you're worried about, say, a Spectre-like attack measuring the
temperature or the power consumption of the CPU to try and guess what operations were performed (or what addresses were accessed, ...) speculatively, then you'd have to be yet more careful.

I am mostly concerned with remote attacks. To block them it should
be enough to ensure that machine never goes into thermal throttling
(I consider adversary who is not able to directly monitor power
or temperature, so only thing remaining is thermal throttling and
its effect on execution time).
--
Waldek Hebisch
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Nov 8 18:48:47 2024

From Newsgroup: comp.arch

On Fri, 8 Nov 2024 17:54:45 +0000, Waldek Hebisch wrote:

Stefan Monnier <monnier@iro.umontreal.ca> wrote:

In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.

You can still make your next predictions based on "architectural state
+ pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.

OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.

I think you can similarly perform eagerly the fetches from speculatively
fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.

It looks extremaly hard if not impossible.

What kind of front end µArchitecture are you assuming that makes
this hard (at all) ??

Seems to me that is there is an instruction buffer and you load the
speculative instructions into it, you can speculatively execute them
and throw them away if they were not supposed to execute. All you
have to avoid is filling I Cache if you were not supposed to have
fetched them.

Thus, not hard at all.

So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting
that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some
sort of load buffer.

Alone that is clearly insufficient.

Agreed insufficient all by itself but when combined...

If "leave no trace" includes not slowing down other concurrent memory

It does not.

accesses (e.g. from other CPUs), it might require some kind of
priority scheme.

First, one needs to ensure that the CPU performing speculative
fetch will not slown down due to say resource contention. If you
put some arbitrary limit like one or two speculative fetches in

Here, you use the word fetch as if it were a LD instruction. Is
that what you intended ?? {{I reserve Fetch for instruction fetches
only}}

flight, that is likely to be detectable by the attacker and may
leak information. If you want several ("arbitrarily many") speculative fetches without slowing down normal execution, that would mean highly overprovisioned machine.

--- Synchronet 3.20a-Linux NewsLink 1.114

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Nov 10 02:05:03 2024

From Newsgroup: comp.arch

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 8 Nov 2024 17:54:45 +0000, Waldek Hebisch wrote:

Stefan Monnier <monnier@iro.umontreal.ca> wrote:

In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.

You can still make your next predictions based on "architectural state
+ pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.

OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.

I think you can similarly perform eagerly the fetches from speculatively >>> fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.

It looks extremaly hard if not impossible.

What kind of front end µArchitecture are you assuming that makes
this hard (at all) ??

Seems to me that is there is an instruction buffer and you load the speculative instructions into it, you can speculatively execute them
and throw them away if they were not supposed to execute. All you
have to avoid is filling I Cache if you were not supposed to have
fetched them.

Thus, not hard at all.

I assume that fetches (reads) from the cache may be multi-clock activity.
So, earlier access may slow down subsequent ones. I do not know
details, but I looked at published data about Intel processors and
this seem to be most plausible explanation for their L2 timings.

IIUC data may go trough some crossbar switch which routes it to
correct cache bank. It is not unusual for crossbars to have
switching delays, that is changing destination incurs say one
clock penalty.

I assume machine competitive on price and performance, which means
manufacturer will use slower blocks when they do not lower avarage
speed. This may lead to slowdown of "real" activity due to
competion with speculative actions.

So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting >>> that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some
sort of load buffer.

Alone that is clearly insufficient.

Agreed insufficient all by itself but when combined...

If "leave no trace" includes not slowing down other concurrent memory

It does not.

accesses (e.g. from other CPUs), it might require some kind of
priority scheme.

First, one needs to ensure that the CPU performing speculative
fetch will not slown down due to say resource contention. If you
put some arbitrary limit like one or two speculative fetches in

Here, you use the word fetch as if it were a LD instruction. Is
that what you intended ?? {{I reserve Fetch for instruction fetches
only}}

I mean any read access, both loads and instruction fetches.

flight, that is likely to be detectable by the attacker and may
leak information. If you want several ("arbitrarily many") speculative
fetches without slowing down normal execution, that would mean highly
overprovisioned machine.

--
Waldek Hebisch
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Nov 10 02:56:05 2024

From Newsgroup: comp.arch

On Sun, 10 Nov 2024 2:05:03 +0000, Waldek Hebisch wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 8 Nov 2024 17:54:45 +0000, Waldek Hebisch wrote:

Stefan Monnier <monnier@iro.umontreal.ca> wrote:

In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.

You can still make your next predictions based on "architectural state >>>> + pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.

OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.

I think you can similarly perform eagerly the fetches from speculatively >>>> fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.

It looks extremaly hard if not impossible.

What kind of front end µArchitecture are you assuming that makes
this hard (at all) ??

Seems to me that is there is an instruction buffer and you load the
speculative instructions into it, you can speculatively execute them
and throw them away if they were not supposed to execute. All you
have to avoid is filling I Cache if you were not supposed to have
fetched them.

Thus, not hard at all.

I assume that fetches (reads) from the cache may be multi-clock
activity.
So, earlier access may slow down subsequent ones. I do not know
details, but I looked at published data about Intel processors and
this seem to be most plausible explanation for their L2 timings.

While access to L1 caches may end up milti-cycle, as long as you
can start a new fetch every cycle, misses do not slow-down subsequent
accesses, just result deliveries (depending on aggressiveness of OoO architecture.

IIUC data may go trough some crossbar switch which routes it to
correct cache bank. It is not unusual for crossbars to have
switching delays, that is changing destination incurs say one
clock penalty.

Caught up in minutia::

Basically, as long as one can fetch "enough" instructions--the
latency of the pipeline is not-very-material. For a 6-wide GBOoO
machine, one needs the residual instructions (not issued last
cycle); a set of new instructions on the sequential path, and
a set of instruction on a predicted taken path. Given that,
one can find enough instructions to DECODE and Issue. {{Where
each set of instructions is <say> 8 words}}

Generally speaking we build pipelines as short as will allow for
our target frequency, as adding pipeline stages increases power
and decreases performance.

I assume machine competitive on price and performance, which means manufacturer will use slower blocks when they do not lower avarage
speed. This may lead to slowdown of "real" activity due to
competion with speculative actions.

That was a necessity at 90nm and larger, but quickly became
irrelevant at smaller geometries. Today we have a concept of
"dark silicon"--any area that is not receiving a clock is
(effectively) dark, and wastes little power.

So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting >>>> that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some >>>> sort of load buffer.

Alone that is clearly insufficient.

Agreed insufficient all by itself but when combined...

If "leave no trace" includes not slowing down other concurrent memory

It does not.

accesses (e.g. from other CPUs), it might require some kind of
priority scheme.

First, one needs to ensure that the CPU performing speculative
fetch will not slown down due to say resource contention. If you
put some arbitrary limit like one or two speculative fetches in

Here, you use the word fetch as if it were a LD instruction. Is
that what you intended ?? {{I reserve Fetch for instruction fetches
only}}

I mean any read access, both loads and instruction fetches.

flight, that is likely to be detectable by the attacker and may
leak information. If you want several ("arbitrarily many") speculative
fetches without slowing down normal execution, that would mean highly
overprovisioned machine.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Nov 11 09:03:36 2024

From Newsgroup: comp.arch

MitchAlsup1 <mitchalsup@aol.com> schrieb:

For example: The classical way to do dense switches is a LD of the
target
address and a jump to the target. This requires verifying the address of
the target. Whereas if you predict as JTT does, you verify by matching
the index number (which is known earlier and since the table is
read-only
you don't need to verify the target address.

So, it is not that you don't predict, it is that the data used to
verify the prediction is more precise and available earlier.

Hmm... just wondering.

How does this handle

switch(a)
{
case 1:
case 3:
case 4:
case 7:
/* Do something */
break;
case 2:
case 5:
case 11:
/* Do something else */
break;
default:
/* Something else entirely */
}

Would the values 1,3,4 and 7 be considered different for
prediciton purposes?
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Nov 11 17:17:47 2024

From Newsgroup: comp.arch

On Mon, 11 Nov 2024 9:03:36 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

For example: The classical way to do dense switches is a LD of the
target
address and a jump to the target. This requires verifying the address of
the target. Whereas if you predict as JTT does, you verify by matching
the index number (which is known earlier and since the table is
read-only
you don't need to verify the target address.

So, it is not that you don't predict, it is that the data used to
verify the prediction is more precise and available earlier.

Hmm... just wondering.

How does this handle

switch(a)
{
case 1:
case 3:
case 4:
case 7:
/* Do something */
break;
case 2:
case 5:
case 11:
/* Do something else */
break;
default:
/* Something else entirely */
}

Would the values 1,3,4 and 7 be considered different for
prediciton purposes?

Probably.

You might take the target IPs and put them in an indirect prediction
target buffer, so that by the time one has (a) one has the first inst- runctions at the target label. Then, verification is by index
comparison\
not by IP-address comparison.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Nov 11 15:36:50 2024

From Newsgroup: comp.arch

I don't understand the "thus not needing prediction". Loading IP from
memory takes time, doesn't it? Depending on your memory hierarchy and
where the data is held, I'd say a minimum of 3 cycles and often more.
What do you do during those cycles?

It is not that these things don't need prediction, it is that you do the prediction and then verify the prediction using different data.

I see, so you still need something similar to a BTB for operations like
JTT, but the delay until you can verify the prediction is shorter, which
should presumably reduce the cost of mispredictions.

For example: The classical way to do dense switches is a LD of the
target address and a jump to the target. This requires verifying the
address of the target. Whereas if you predict as JTT does, you verify
by matching the index number (which is known earlier and since the
table is read-only you don't need to verify the target address.

Hmm... but in order not to have bubbles, your prediction structure still
needs to give you a predicted target address (rather than a predicted
index number), right?
Also in order to be able to verify the index rather than the
target address, your prediction structure will *also* need to give you
the predicted index?
So, rather than a BTB which just gives you a predicted target address,
you'll need something that returns both a target address and the
corresponding index (and the correspondence needs to be reliable if we
verify only the index, tho I guess you could also verify both)?

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Nov 11 15:55:01 2024

From Newsgroup: comp.arch

If "leave no trace" includes not slowing down other concurrent memory
accesses (e.g. from other CPUs), it might require some kind of
priority scheme.

First, one needs to ensure that the CPU performing speculative
fetch will not slown down due to say resource contention.

AFAIK there have been attempts at using resource contention as a side
channel, indeed, but the effectiveness depends on the scale of the impact
of resource contention compared to the "ambient noise".

IOW, there we get into the realm of quantifying the leaks.

But AFAIK if you limit yourself to L1 accesses I believe it's not too
hard to avoid all resource contentions (i.e. make sure that
no speculative access can slow down a less speculative access).

If you want several ("arbitrarily many") speculative fetches without
slowing down normal execution, that would mean highly
overprovisioned machine.

Once you're into the higher levels of cache, I don't have a clear
picture of how hard/easy it would be to maintain the "no
trace" guarantee. And the higher in the hierarchy you go, the more
visible any slowdown will be, I expect.

If you're worried about, say, a Spectre-like attack measuring the
temperature or the power consumption of the CPU to try and guess what
operations were performed (or what addresses were accessed, ...)
speculatively, then you'd have to be yet more careful.

I am mostly concerned with remote attacks. To block them it should
be enough to ensure that machine never goes into thermal throttling
(I consider adversary who is not able to directly monitor power
or temperature, so only thing remaining is thermal throttling and
its effect on execution time).

Thermal and power information is usually made available to non-root
processes, so I think we should assume remote attackers have access
to it. Maybe the granularity is coarse enough to make the leak
impractical (but we're again in the realm of quantifying leaks, here).

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Nov 11 21:23:00 2024

From Newsgroup: comp.arch

On Mon, 11 Nov 2024 20:36:50 +0000, Stefan Monnier wrote:

I don't understand the "thus not needing prediction". Loading IP from
memory takes time, doesn't it? Depending on your memory hierarchy and
where the data is held, I'd say a minimum of 3 cycles and often more.
What do you do during those cycles?

It is not that these things don't need prediction, it is that you do the
prediction and then verify the prediction using different data.

I see, so you still need something similar to a BTB for operations like
JTT, but the delay until you can verify the prediction is shorter, which should presumably reduce the cost of mispredictions.

Instead of verifying you got the right Target address (62-bits) you
can verify you picked the proper index from the table (8-ish bits).
So, it is shorter in the pipeline, and fewer bits to verify (and index
the tables with--less hashing,...)

For example: The classical way to do dense switches is a LD of the
target address and a jump to the target. This requires verifying the
address of the target. Whereas if you predict as JTT does, you verify
by matching the index number (which is known earlier and since the
table is read-only you don't need to verify the target address.

Hmm... but in order not to have bubbles, your prediction structure still needs to give you a predicted target address (rather than a predicted
index number), right?

Yes, but you use the predicted index number to find the predicted
target IP. And then verify the index later.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Nov 11 17:10:14 2024

From Newsgroup: comp.arch

Hmm... but in order not to have bubbles, your prediction structure still
needs to give you a predicted target address (rather than a predicted
index number), right?

Yes, but you use the predicted index number to find the predicted
target IP.

Hmm... but that would require fetching that info from memory.
Can you do that without introducing bubbles?
If you're lucky it's in the L1 Icache, but that still takes a couple
cycles to get, doesn't it?
Or do you have a dedicated "jump table cache" as part of your jump
prediction tables? [ Even if you do, it still means your prediction
has to first predict an index and then look it up in the table, which
increases its latency. I don't know what kind of latency is used in
current state of the art predictors, but IIUC any increase in latency
can be quite costly. ]

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Nov 12 00:15:31 2024

From Newsgroup: comp.arch

On Mon, 11 Nov 2024 22:10:14 +0000, Stefan Monnier wrote:

Hmm... but in order not to have bubbles, your prediction structure still >>> needs to give you a predicted target address (rather than a predicted
index number), right?

Yes, but you use the predicted index number to find the predicted
target IP.

Hmm... but that would require fetching that info from memory.
Can you do that without introducing bubbles?

In many/most (dynamic) cases, they have already been fetched and all
that
is needed is muxing the indexed field out of Instruction Buffer.

If you're lucky it's in the L1 Icache, but that still takes a couple
cycles to get, doesn't it?

My 1-wide machine fetches 4-words per cycle.
My 6-wide machine fetches 3 ½-cache-lines per cycle.

Sure, if the indexed field is not already present, then you have to
go fetch it, but since the table immediately follows JTT, most of the
time, they have already arrived by the time JTT gets to DECODE.

Or do you have a dedicated "jump table cache" as part of your jump
prediction tables? [ Even if you do, it still means your prediction
has to first predict an index and then look it up in the table, which increases its latency. I don't know what kind of latency is used in
current state of the art predictors, but IIUC any increase in latency
can be quite costly. ]

For the wider OoO machine, you will have something like a jump table
cache
hashed with some branch history and other data to "whiten" the address
space so one JTT table does not alias with another.

Stefan

--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Nov 12 14:00:02 2024

From Newsgroup: comp.arch

Hmm... but in order not to have bubbles, your prediction structure still >>>> needs to give you a predicted target address (rather than a predicted
index number), right?

Yes, but you use the predicted index number to find the predicted
target IP.

Hmm... but that would require fetching that info from memory.
Can you do that without introducing bubbles?

In many/most (dynamic) cases, they have already been fetched and all
that is needed is muxing the indexed field out of Instruction Buffer.

I guess for small jump table that would work well, indeed, but for
something like a bytecode interpreter, even if you can compact it to
have only 16bit per entry, that still spans 512B. Is your IB large
enough for that?

If you're lucky it's in the L1 Icache, but that still takes a couple
cycles to get, doesn't it?

My 1-wide machine fetches 4-words per cycle.
My 6-wide machine fetches 3 �-cache-lines per cycle.

Even with a 256B cache line width, it would take 2 cycles to get a 512B
jump table into your IB, after which you still have to select (and
compute, if the table is compacted) the corresponding target address,
and only after that can you start fetching (which itself will suffer
the L1 latency), so we're up to a 5-6 cycle bubble, no?

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Nov 12 21:31:12 2024

From Newsgroup: comp.arch

On Tue, 12 Nov 2024 19:00:02 +0000, Stefan Monnier wrote:

Hmm... but in order not to have bubbles, your prediction structure still >>>>> needs to give you a predicted target address (rather than a predicted >>>>> index number), right?

Yes, but you use the predicted index number to find the predicted
target IP.

Hmm... but that would require fetching that info from memory.
Can you do that without introducing bubbles?

In many/most (dynamic) cases, they have already been fetched and all
that is needed is muxing the indexed field out of Instruction Buffer.

I guess for small jump table that would work well, indeed, but for
something like a bytecode interpreter, even if you can compact it to
have only 16bit per entry, that still spans 512B. Is your IB large
enough for that?

For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.

If you're lucky it's in the L1 Icache, but that still takes a couple
cycles to get, doesn't it?

My 1-wide machine fetches 4-words per cycle.
My 6-wide machine fetches 3 ½-cache-lines per cycle.

Even with a 256B cache line width, it would take 2 cycles to get a 512B
jump table into your IB, after which you still have to select (and
compute, if the table is compacted) the corresponding target address,
and only after that can you start fetching (which itself will suffer
the L1 latency), so we're up to a 5-6 cycle bubble, no?

a) line size is 512-bits or 64-bytes.
b) yes accessing the jump table when it is not in IB takes several
..extra cycles
c) JTT performs range comparison, memory LD [IP+Ri<<sc], IP+=data<<2
d) the cycle count should be 4-5 cycles or a bit faster than if SW
..executed the same semantic content.

The LD does not have to be under the shadow of the range comparison
as the data can be thrown away if the range comparison fails and we
go to default.

Stefan

--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Nov 12 17:21:25 2024

From Newsgroup: comp.arch

I guess for small jump table that would work well, indeed, but for
something like a bytecode interpreter, even if you can compact it to
have only 16bit per entry, that still spans 512B. Is your IB large
enough for that?

For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.

Current Intel processors do a surprisingly good job of it.

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From Brett@ggtgp@yahoo.com to comp.arch on Tue Nov 12 22:21:37 2024

From Newsgroup: comp.arch

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 11 Nov 2024 22:10:14 +0000, Stefan Monnier wrote:

Hmm... but in order not to have bubbles, your prediction structure still >>>> needs to give you a predicted target address (rather than a predicted
index number), right?

Yes, but you use the predicted index number to find the predicted
target IP.

Hmm... but that would require fetching that info from memory.
Can you do that without introducing bubbles?

In many/most (dynamic) cases, they have already been fetched and all
that
is needed is muxing the indexed field out of Instruction Buffer.

If you're lucky it's in the L1 Icache, but that still takes a couple
cycles to get, doesn't it?

My 1-wide machine fetches 4-words per cycle.

A two wide machine only adds a AGU and can do a load and math the same
cycle, so how much bigger is that?

My 6-wide machine fetches 3 ½-cache-lines per cycle.

Sure, if the indexed field is not already present, then you have to
go fetch it, but since the table immediately follows JTT, most of the
time, they have already arrived by the time JTT gets to DECODE.

Or do you have a dedicated "jump table cache" as part of your jump
prediction tables? [ Even if you do, it still means your prediction
has to first predict an index and then look it up in the table, which
increases its latency. I don't know what kind of latency is used in
current state of the art predictors, but IIUC any increase in latency
can be quite costly. ]

For the wider OoO machine, you will have something like a jump table
cache
hashed with some branch history and other data to "whiten" the address
space so one JTT table does not alias with another.

Stefan

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Nov 13 00:32:20 2024

From Newsgroup: comp.arch

On Tue, 12 Nov 2024 22:21:37 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

My 1-wide machine fetches 4-words per cycle.

A two wide machine only adds a AGU and can do a load and math the same
cycle, so how much bigger is that?

Data path--only forwarding increases.
PARSE and DECODE grow by 2.5×
Register file grows by 25%
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 13 07:49:33 2024

From Newsgroup: comp.arch

Stefan Monnier wrote:

I guess for small jump table that would work well, indeed, but for
something like a bytecode interpreter, even if you can compact it to
have only 16bit per entry, that still spans 512B. Is your IB large
enough for that?

For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.

Current Intel processors do a surprisingly good job of it.

I'm guessing this is the branch pattern recognizer at work, i.e some
sequences of bytecode ops are much more common than a simple random walk through the opcodes, and those sequences make it possible to predict the
next bytecode jump to be performed?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Nov 13 10:53:08 2024

From Newsgroup: comp.arch

I guess for small jump table that would work well, indeed, but for
something like a bytecode interpreter, even if you can compact it to
have only 16bit per entry, that still spans 512B. Is your IB large
enough for that?

For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.

Current Intel processors do a surprisingly good job of it.

I'm guessing this is the branch pattern recognizer at work, i.e some sequences of bytecode ops are much more common than a simple random walk through the opcodes, and those sequences make it possible to predict the
next bytecode jump to be performed?

Yes. If you consider the two levels of language at play (the CPU view
at the level of machine language and the interpreter's view of the
bytecode language), IIUC the branch history of the CPU view ends up
being a usable approximation of the bytecode-level instruction pointer,
so for those sections of your bytecode-level program which are sufficiently repetitive, the prediction applied by the CPU can correctly predict the
next bytecode.

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 13 20:54:56 2024

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

If you consider the two levels of language at play (the CPU view
at the level of machine language and the interpreter's view of the
bytecode language), IIUC the branch history of the CPU view ends up
being a usable approximation of the bytecode-level instruction pointer,
so for those sections of your bytecode-level program which are sufficiently repetitive, the prediction applied by the CPU can correctly predict the
next bytecode.

To me, that looks like the bytecode is insufficiently expressive, if
an often-repeating sequence occurs :-)

In the presence of a good predictor, it would still be an interesting
question if a more compressed version would actually perform better.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Nov 13 16:38:19 2024

From Newsgroup: comp.arch

To me, that looks like the bytecode is insufficiently expressive, if
an often-repeating sequence occurs :-)

The "often-repeating sequence" is usually because there is a loop in the bytecode program.

Are you suggesting that bytecode languages should shun loops, presumably
by replacing such a loop by a dedicated bytecode instruction?
That might be an option for those rare bytecode interpreters that are sophisticated enough to generate program-specific bytecode instructions,
but those are sliding toward JIT compiler-land.

For your run-of-the-mill bytecode interpreter which has ~200
instructions defined manually once and for all, "often-repeating
sequence" are the lay of the land, I think.

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 14 06:54:04 2024

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

To me, that looks like the bytecode is insufficiently expressive, if
an often-repeating sequence occurs :-)

The "often-repeating sequence" is usually because there is a loop in the bytecode program.

Are you suggesting that bytecode languages should shun loops, presumably
by replacing such a loop by a dedicated bytecode instruction?

Cray rulez :-)

That might be an option for those rare bytecode interpreters that are sophisticated enough to generate program-specific bytecode instructions,
but those are sliding toward JIT compiler-land.

For your run-of-the-mill bytecode interpreter which has ~200
instructions defined manually once and for all, "often-repeating
sequence" are the lay of the land, I think.

Dynamic, yes. For static, I think something like Mitch's VVM might
also be interesting.
--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	991
Nodes:	10 (0 / 10)
Uptime:	119:19:40
Calls:	12,958
Files:	186,574
Messages:	3,265,634

Reverse engineering of Intel branch predictors

Who's Online

System Info