Seems like Intel branch predictors have been pretty completely reverse-engineered. The following paper promises to for very
interesting reading:
https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi
I wonder what you think of this...
In MY 66000 ISA::
a) RET is not predicted
b) switch() is not predicted
c) method calls are not predicted
d) GOT calls are not predicted
Which pretty much gets rid of the problem.
c+d) GOT calls and method calls use the CALX instruction which
loads IP from memory--thus not needing prediction--and not using
a trampoline, either.
b) switch() is the JTT instruction
Seems like Intel branch predictors have been pretty completely reverse-engineered. The following paper promises to for very
interesting reading:
https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi
I wonder what you think of this...
Thomas Koenig <tkoenig@netcologne.de> wrote:
Seems like Intel branch predictors have been pretty completely
reverse-engineered. The following paper promises to for very
interesting reading:
https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi
I wonder what you think of this...
There are more papers on this topic. There were several papers
on variations of Spectre. I think that there is simple condition
which guarantees that nothing Spectre-related affects given
processor: the sequence of microarchitecutral operations (incuding speculative operations) should depend only on architecturaly
executed instructions.
So, processor may do widely speculative
things, but only base speculation on architecturaly executed
instructions.
Some people try to just close single hole at
a time, IMO it is hopeless, there are too many possible
variations.
And weaker conditions, like "cancelling" effects
of speculative instructions are likely to fail.
In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.
OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.
In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.
You can still make your next predictions based on "architectural state
+ pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.
OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.
I think you can similarly perform eagerly the fetches from speculatively fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.
So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting
that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some
sort of load buffer.
If "leave no trace" includes not slowing down other concurrent memory accesses (e.g. from other CPUs), it might require some kind of
priority scheme.
If you're worried about, say, a Spectre-like attack measuring the
temperature or the power consumption of the CPU to try and guess what operations were performed (or what addresses were accessed, ...) speculatively, then you'd have to be yet more careful.
If you're worried about an attacker that can witnessing the actual
accesses to the DRAM, ...
Stefan--- Synchronet 3.20a-Linux NewsLink 1.114
In MY 66000 ISA::
a) RET is not predicted
b) switch() is not predicted
c) method calls are not predicted
d) GOT calls are not predicted
Which pretty much gets rid of the problem.
By "the problem" I guess you mean "indirect jumps", right?
c+d) GOT calls and method calls use the CALX instruction which
loads IP from memory--thus not needing prediction--and not using
a trampoline, either.
I don't understand the "thus not needing prediction". Loading IP from
memory takes time, doesn't it? Depending on your memory hierarchy and
where the data is held, I'd say a minimum of 3 cycles and often more.
What do you do during those cycles?
b) switch() is the JTT instruction
IIUC this is basically like CALX, except with a bit more work to
generate the address from which you fetch the IP (plus a bit more work
to generate the IP from the fetched data as well). So: same question.
Stefan--- Synchronet 3.20a-Linux NewsLink 1.114
In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.
You can still make your next predictions based on "architectural state
+ pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.
OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.
I think you can similarly perform eagerly the fetches from speculatively fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.
So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting
that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some
sort of load buffer.
If "leave no trace" includes not slowing down other concurrent memory accesses (e.g. from other CPUs), it might require some kind of
priority scheme.
If you're worried about, say, a Spectre-like attack measuring the
temperature or the power consumption of the CPU to try and guess what operations were performed (or what addresses were accessed, ...) speculatively, then you'd have to be yet more careful.
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.
You can still make your next predictions based on "architectural state
+ pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.
OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.
I think you can similarly perform eagerly the fetches from speculatively
fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.
It looks extremaly hard if not impossible.
So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting
that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some
sort of load buffer.
Alone that is clearly insufficient.
If "leave no trace" includes not slowing down other concurrent memory
accesses (e.g. from other CPUs), it might require some kind of
priority scheme.
First, one needs to ensure that the CPU performing speculative
fetch will not slown down due to say resource contention. If you
put some arbitrary limit like one or two speculative fetches in
flight, that is likely to be detectable by the attacker and may--- Synchronet 3.20a-Linux NewsLink 1.114
leak information. If you want several ("arbitrarily many") speculative fetches without slowing down normal execution, that would mean highly overprovisioned machine.
On Fri, 8 Nov 2024 17:54:45 +0000, Waldek Hebisch wrote:
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.
You can still make your next predictions based on "architectural state
+ pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.
OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.
I think you can similarly perform eagerly the fetches from speculatively >>> fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.
It looks extremaly hard if not impossible.
What kind of front end µArchitecture are you assuming that makes
this hard (at all) ??
Seems to me that is there is an instruction buffer and you load the speculative instructions into it, you can speculatively execute them
and throw them away if they were not supposed to execute. All you
have to avoid is filling I Cache if you were not supposed to have
fetched them.
Thus, not hard at all.
So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting >>> that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some
sort of load buffer.
Alone that is clearly insufficient.
Agreed insufficient all by itself but when combined...
If "leave no trace" includes not slowing down other concurrent memory
It does not.
accesses (e.g. from other CPUs), it might require some kind of
priority scheme.
First, one needs to ensure that the CPU performing speculative
fetch will not slown down due to say resource contention. If you
put some arbitrary limit like one or two speculative fetches in
Here, you use the word fetch as if it were a LD instruction. Is
that what you intended ?? {{I reserve Fetch for instruction fetches
only}}
--flight, that is likely to be detectable by the attacker and may
leak information. If you want several ("arbitrarily many") speculative
fetches without slowing down normal execution, that would mean highly
overprovisioned machine.
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Fri, 8 Nov 2024 17:54:45 +0000, Waldek Hebisch wrote:
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
In case of branch predictor itself it means delay feedback by some
number of clocks, which looks like minor cost.
You can still make your next predictions based on "architectural state >>>> + pending predictions" if the pending predictions themselves only
depend ultimately on the architectural state.
OTOH delaying fetches from speculatively fetched addresses will
increase latency on critical path, possibly leading to
significant slowdown.
I think you can similarly perform eagerly the fetches from speculatively >>>> fetched addresses but only if you can ensure that these will leave no
trace if the speculation happens to fail.
It looks extremaly hard if not impossible.
What kind of front end µArchitecture are you assuming that makes
this hard (at all) ??
Seems to me that is there is an instruction buffer and you load the
speculative instructions into it, you can speculatively execute them
and throw them away if they were not supposed to execute. All you
have to avoid is filling I Cache if you were not supposed to have
fetched them.
Thus, not hard at all.
I assume that fetches (reads) from the cache may be multi-clock
activity.
So, earlier access may slow down subsequent ones. I do not know
details, but I looked at published data about Intel processors and
this seem to be most plausible explanation for their L2 timings.
IIUC data may go trough some crossbar switch which routes it to
correct cache bank. It is not unusual for crossbars to have
switching delays, that is changing destination incurs say one
clock penalty.
I assume machine competitive on price and performance, which means manufacturer will use slower blocks when they do not lower avarage
speed. This may lead to slowdown of "real" activity due to
competion with speculative actions.
--- Synchronet 3.20a-Linux NewsLink 1.114So whether and how you can do it depends the definition of "leave no
trace". E.g. Mitch argues you can do it if you can refrain from putting >>>> that info into the normal cache (where it would have to displace
something else, thus leaving a trace) and instead have to keep it in
what we could call a "speculative cache" but would likely be just some >>>> sort of load buffer.
Alone that is clearly insufficient.
Agreed insufficient all by itself but when combined...
If "leave no trace" includes not slowing down other concurrent memory
It does not.
accesses (e.g. from other CPUs), it might require some kind of
priority scheme.
First, one needs to ensure that the CPU performing speculative
fetch will not slown down due to say resource contention. If you
put some arbitrary limit like one or two speculative fetches in
Here, you use the word fetch as if it were a LD instruction. Is
that what you intended ?? {{I reserve Fetch for instruction fetches
only}}
I mean any read access, both loads and instruction fetches.
flight, that is likely to be detectable by the attacker and may
leak information. If you want several ("arbitrarily many") speculative
fetches without slowing down normal execution, that would mean highly
overprovisioned machine.
For example: The classical way to do dense switches is a LD of the
target
address and a jump to the target. This requires verifying the address of
the target. Whereas if you predict as JTT does, you verify by matching
the index number (which is known earlier and since the table is
read-only
you don't need to verify the target address.
So, it is not that you don't predict, it is that the data used to
verify the prediction is more precise and available earlier.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
For example: The classical way to do dense switches is a LD of the
target
address and a jump to the target. This requires verifying the address of
the target. Whereas if you predict as JTT does, you verify by matching
the index number (which is known earlier and since the table is
read-only
you don't need to verify the target address.
So, it is not that you don't predict, it is that the data used to
verify the prediction is more precise and available earlier.
Hmm... just wondering.
How does this handle
switch(a)
{
case 1:
case 3:
case 4:
case 7:
/* Do something */
break;
case 2:
case 5:
case 11:
/* Do something else */
break;
default:
/* Something else entirely */
}
Would the values 1,3,4 and 7 be considered different for
prediciton purposes?
I don't understand the "thus not needing prediction". Loading IP fromIt is not that these things don't need prediction, it is that you do the prediction and then verify the prediction using different data.
memory takes time, doesn't it? Depending on your memory hierarchy and
where the data is held, I'd say a minimum of 3 cycles and often more.
What do you do during those cycles?
For example: The classical way to do dense switches is a LD of the
target address and a jump to the target. This requires verifying the
address of the target. Whereas if you predict as JTT does, you verify
by matching the index number (which is known earlier and since the
table is read-only you don't need to verify the target address.
If "leave no trace" includes not slowing down other concurrent memoryFirst, one needs to ensure that the CPU performing speculative
accesses (e.g. from other CPUs), it might require some kind of
priority scheme.
fetch will not slown down due to say resource contention.
If you want several ("arbitrarily many") speculative fetches without
slowing down normal execution, that would mean highly
overprovisioned machine.
If you're worried about, say, a Spectre-like attack measuring theI am mostly concerned with remote attacks. To block them it should
temperature or the power consumption of the CPU to try and guess what
operations were performed (or what addresses were accessed, ...)
speculatively, then you'd have to be yet more careful.
be enough to ensure that machine never goes into thermal throttling
(I consider adversary who is not able to directly monitor power
or temperature, so only thing remaining is thermal throttling and
its effect on execution time).
I don't understand the "thus not needing prediction". Loading IP fromIt is not that these things don't need prediction, it is that you do the
memory takes time, doesn't it? Depending on your memory hierarchy and
where the data is held, I'd say a minimum of 3 cycles and often more.
What do you do during those cycles?
prediction and then verify the prediction using different data.
I see, so you still need something similar to a BTB for operations like
JTT, but the delay until you can verify the prediction is shorter, which should presumably reduce the cost of mispredictions.
For example: The classical way to do dense switches is a LD of the
target address and a jump to the target. This requires verifying the
address of the target. Whereas if you predict as JTT does, you verify
by matching the index number (which is known earlier and since the
table is read-only you don't need to verify the target address.
Hmm... but in order not to have bubbles, your prediction structure still needs to give you a predicted target address (rather than a predicted
index number), right?
Hmm... but in order not to have bubbles, your prediction structure stillYes, but you use the predicted index number to find the predicted
needs to give you a predicted target address (rather than a predicted
index number), right?
target IP.
Hmm... but in order not to have bubbles, your prediction structure still >>> needs to give you a predicted target address (rather than a predictedYes, but you use the predicted index number to find the predicted
index number), right?
target IP.
Hmm... but that would require fetching that info from memory.
Can you do that without introducing bubbles?
If you're lucky it's in the L1 Icache, but that still takes a couple
cycles to get, doesn't it?
Or do you have a dedicated "jump table cache" as part of your jump
prediction tables? [ Even if you do, it still means your prediction
has to first predict an index and then look it up in the table, which increases its latency. I don't know what kind of latency is used in
current state of the art predictors, but IIUC any increase in latency
can be quite costly. ]
Stefan--- Synchronet 3.20a-Linux NewsLink 1.114
Hmm... but that would require fetching that info from memory.Hmm... but in order not to have bubbles, your prediction structure still >>>> needs to give you a predicted target address (rather than a predictedYes, but you use the predicted index number to find the predicted
index number), right?
target IP.
Can you do that without introducing bubbles?
In many/most (dynamic) cases, they have already been fetched and all
that is needed is muxing the indexed field out of Instruction Buffer.
If you're lucky it's in the L1 Icache, but that still takes a coupleMy 1-wide machine fetches 4-words per cycle.
cycles to get, doesn't it?
My 6-wide machine fetches 3 ½-cache-lines per cycle.
Hmm... but that would require fetching that info from memory.Hmm... but in order not to have bubbles, your prediction structure still >>>>> needs to give you a predicted target address (rather than a predicted >>>>> index number), right?Yes, but you use the predicted index number to find the predicted
target IP.
Can you do that without introducing bubbles?
In many/most (dynamic) cases, they have already been fetched and all
that is needed is muxing the indexed field out of Instruction Buffer.
I guess for small jump table that would work well, indeed, but for
something like a bytecode interpreter, even if you can compact it to
have only 16bit per entry, that still spans 512B. Is your IB large
enough for that?
If you're lucky it's in the L1 Icache, but that still takes a coupleMy 1-wide machine fetches 4-words per cycle.
cycles to get, doesn't it?
My 6-wide machine fetches 3 ½-cache-lines per cycle.
Even with a 256B cache line width, it would take 2 cycles to get a 512B
jump table into your IB, after which you still have to select (and
compute, if the table is compacted) the corresponding target address,
and only after that can you start fetching (which itself will suffer
the L1 latency), so we're up to a 5-6 cycle bubble, no?
Stefan--- Synchronet 3.20a-Linux NewsLink 1.114
I guess for small jump table that would work well, indeed, but forFor something like a bytecode interpreter, the prediction accuracy
something like a bytecode interpreter, even if you can compact it to
have only 16bit per entry, that still spans 512B. Is your IB large
enough for that?
of the jump predictor is going to be epically bad.
On Mon, 11 Nov 2024 22:10:14 +0000, Stefan Monnier wrote:
Hmm... but in order not to have bubbles, your prediction structure still >>>> needs to give you a predicted target address (rather than a predictedYes, but you use the predicted index number to find the predicted
index number), right?
target IP.
Hmm... but that would require fetching that info from memory.
Can you do that without introducing bubbles?
In many/most (dynamic) cases, they have already been fetched and all
that
is needed is muxing the indexed field out of Instruction Buffer.
If you're lucky it's in the L1 Icache, but that still takes a couple
cycles to get, doesn't it?
My 1-wide machine fetches 4-words per cycle.
My 6-wide machine fetches 3 ½-cache-lines per cycle.
Sure, if the indexed field is not already present, then you have to
go fetch it, but since the table immediately follows JTT, most of the
time, they have already arrived by the time JTT gets to DECODE.
Or do you have a dedicated "jump table cache" as part of your jump
prediction tables? [ Even if you do, it still means your prediction
has to first predict an index and then look it up in the table, which
increases its latency. I don't know what kind of latency is used in
current state of the art predictors, but IIUC any increase in latency
can be quite costly. ]
For the wider OoO machine, you will have something like a jump table
cache
hashed with some branch history and other data to "whiten" the address
space so one JTT table does not alias with another.
Stefan
MitchAlsup1 <mitchalsup@aol.com> wrote:
My 1-wide machine fetches 4-words per cycle.
A two wide machine only adds a AGU and can do a load and math the same
cycle, so how much bigger is that?
I guess for small jump table that would work well, indeed, but forFor something like a bytecode interpreter, the prediction accuracy
something like a bytecode interpreter, even if you can compact it to
have only 16bit per entry, that still spans 512B. Is your IB large
enough for that?
of the jump predictor is going to be epically bad.
Current Intel processors do a surprisingly good job of it.
I'm guessing this is the branch pattern recognizer at work, i.e some sequences of bytecode ops are much more common than a simple random walk through the opcodes, and those sequences make it possible to predict theCurrent Intel processors do a surprisingly good job of it.I guess for small jump table that would work well, indeed, but forFor something like a bytecode interpreter, the prediction accuracy
something like a bytecode interpreter, even if you can compact it to
have only 16bit per entry, that still spans 512B. Is your IB large
enough for that?
of the jump predictor is going to be epically bad.
next bytecode jump to be performed?
If you consider the two levels of language at play (the CPU view
at the level of machine language and the interpreter's view of the
bytecode language), IIUC the branch history of the CPU view ends up
being a usable approximation of the bytecode-level instruction pointer,
so for those sections of your bytecode-level program which are sufficiently repetitive, the prediction applied by the CPU can correctly predict the
next bytecode.
To me, that looks like the bytecode is insufficiently expressive, if
an often-repeating sequence occurs :-)
To me, that looks like the bytecode is insufficiently expressive, if
an often-repeating sequence occurs :-)
The "often-repeating sequence" is usually because there is a loop in the bytecode program.
Are you suggesting that bytecode languages should shun loops, presumably
by replacing such a loop by a dedicated bytecode instruction?
That might be an option for those rare bytecode interpreters that are sophisticated enough to generate program-specific bytecode instructions,
but those are sliding toward JIT compiler-land.
For your run-of-the-mill bytecode interpreter which has ~200
instructions defined manually once and for all, "often-repeating
sequence" are the lay of the land, I think.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 991 |
Nodes: | 10 (0 / 10) |
Uptime: | 119:19:40 |
Calls: | 12,958 |
Files: | 186,574 |
Messages: | 3,265,634 |