• Reverse engineering of Intel branch predictors

    From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 23 18:36:16 2024
    From Newsgroup: comp.arch

    Seems like Intel branch predictors have been pretty completely reverse-engineered. The following paper promises to for very
    interesting reading:

    https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi

    I wonder what you think of this...
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 23 21:06:41 2024
    From Newsgroup: comp.arch

    On Wed, 23 Oct 2024 18:36:16 +0000, Thomas Koenig wrote:

    Seems like Intel branch predictors have been pretty completely reverse-engineered. The following paper promises to for very
    interesting reading:

    https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi

    I wonder what you think of this...

    A couple of points (likely perinate only to me)::

    In MY 66000 ISA::
    a) RET is not predicted
    b) switch() is not predicted
    c) method calls are not predicted
    d) GOT calls are not predicted

    Which pretty much gets rid of the problem.

    c+d) GOT calls and method calls use the CALX instruction which
    loads IP from memory--thus not needing prediction--and not using
    a trampoline, either.

    a) When RET is seen in the instruction stream more than 1 cycle
    in front of the DECODE point, various mechanisms are used to
    fetch instructions at the return address.

    b) switch() is the JTT instruction
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Oct 28 13:55:29 2024
    From Newsgroup: comp.arch

    In MY 66000 ISA::
    a) RET is not predicted
    b) switch() is not predicted
    c) method calls are not predicted
    d) GOT calls are not predicted

    Which pretty much gets rid of the problem.

    By "the problem" I guess you mean "indirect jumps", right?

    c+d) GOT calls and method calls use the CALX instruction which
    loads IP from memory--thus not needing prediction--and not using
    a trampoline, either.

    I don't understand the "thus not needing prediction". Loading IP from
    memory takes time, doesn't it? Depending on your memory hierarchy and
    where the data is held, I'd say a minimum of 3 cycles and often more.
    What do you do during those cycles?

    b) switch() is the JTT instruction

    IIUC this is basically like CALX, except with a bit more work to
    generate the address from which you fetch the IP (plus a bit more work
    to generate the IP from the fetched data as well). So: same question.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Nov 1 19:04:38 2024
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Seems like Intel branch predictors have been pretty completely reverse-engineered. The following paper promises to for very
    interesting reading:

    https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi

    I wonder what you think of this...

    There are more papers on this topic. There were several papers
    on variations of Spectre. I think that there is simple condition
    which guarantees that nothing Spectre-related affects given
    processor: the sequence of microarchitecutral operations (incuding
    speculative operations) should depend only on architecturaly
    executed instructions. So, processor may do widely speculative
    things, but only base speculation on architecturaly executed
    instructions. Some people try to just close single hole at
    a time, IMO it is hopeless, there are too many possible
    variations. And weaker conditions, like "cancelling" effects
    of speculative instructions are likely to fail.

    My impression is that it is relatively easy to modify Intel
    scheme to depend only on architectural state. Impact of such
    restriction on performance is not clear. In case of branch
    predictor itself it means delay feedback by some number of clocks,
    which looks like minor cost. OTOH delaying fetches from
    speculatively fetched addresses will increase latency on
    critical path, possibly leading to significant slowdown.
    --
    Waldek Hebisch
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Nov 1 21:21:53 2024
    From Newsgroup: comp.arch

    On Fri, 1 Nov 2024 19:04:38 +0000, Waldek Hebisch wrote:

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Seems like Intel branch predictors have been pretty completely
    reverse-engineered. The following paper promises to for very
    interesting reading:

    https://www.usenix.org/conference/usenixsecurity24/presentation/li-luyi

    I wonder what you think of this...

    There are more papers on this topic. There were several papers
    on variations of Spectre. I think that there is simple condition
    which guarantees that nothing Spectre-related affects given
    processor: the sequence of microarchitecutral operations (incuding speculative operations) should depend only on architecturaly
    executed instructions.

    The easiest way to state this succinctly is::
    No microarchitectural state can be updated until the instruction
    causing said update retires. Caches, TLBs, and all predictor
    state(s) has to be included in the above.

    So, processor may do widely speculative
    things, but only base speculation on architecturaly executed
    instructions.

    The processor can maintain look-ahead predictor state and use it
    at decode/issue but it has to verify said state in order to retire.

    Some people try to just close single hole at
    a time, IMO it is hopeless, there are too many possible
    variations.

    So far its been almost t10 years of closing one hole after another
    and they are still closing holes. So, I agree with the hopeless-
    ness of that line of attack.

    And weaker conditions, like "cancelling" effects
    of speculative instructions are likely to fail.

    These also lead to replay-storms.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Nov 5 14:26:28 2024
    From Newsgroup: comp.arch

    In case of branch predictor itself it means delay feedback by some
    number of clocks, which looks like minor cost.

    You can still make your next predictions based on "architectural state
    + pending predictions" if the pending predictions themselves only
    depend ultimately on the architectural state.

    OTOH delaying fetches from speculatively fetched addresses will
    increase latency on critical path, possibly leading to
    significant slowdown.

    I think you can similarly perform eagerly the fetches from speculatively fetched addresses but only if you can ensure that these will leave no
    trace if the speculation happens to fail.

    So whether and how you can do it depends the definition of "leave no
    trace". E.g. Mitch argues you can do it if you can refrain from putting
    that info into the normal cache (where it would have to displace
    something else, thus leaving a trace) and instead have to keep it in
    what we could call a "speculative cache" but would likely be just some
    sort of load buffer.

    If "leave no trace" includes not slowing down other concurrent memory
    accesses (e.g. from other CPUs), it might require some kind of
    priority scheme.

    If you're worried about, say, a Spectre-like attack measuring the
    temperature or the power consumption of the CPU to try and guess what operations were performed (or what addresses were accessed, ...)
    speculatively, then you'd have to be yet more careful.

    If you're worried about an attacker that can witnessing the actual
    accesses to the DRAM, ...


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Nov 5 20:17:46 2024
    From Newsgroup: comp.arch

    On Tue, 5 Nov 2024 19:26:28 +0000, Stefan Monnier wrote:

    In case of branch predictor itself it means delay feedback by some
    number of clocks, which looks like minor cost.

    You can still make your next predictions based on "architectural state
    + pending predictions" if the pending predictions themselves only
    depend ultimately on the architectural state.

    Key word:: Ultimately !

    OTOH delaying fetches from speculatively fetched addresses will
    increase latency on critical path, possibly leading to
    significant slowdown.

    I think you can similarly perform eagerly the fetches from speculatively fetched addresses but only if you can ensure that these will leave no
    trace if the speculation happens to fail.

    This is also how one recovers an out-of-memory-order speculative
    memref--you quit the OoO request and replay it from AGEN. This
    allows for relaxed memory order performance while achieving
    stronger memory order when OoO, that could cause order-harm,
    is detected.

    So whether and how you can do it depends the definition of "leave no
    trace". E.g. Mitch argues you can do it if you can refrain from putting
    that info into the normal cache (where it would have to displace
    something else, thus leaving a trace) and instead have to keep it in
    what we could call a "speculative cache" but would likely be just some
    sort of load buffer.

    Leave no trace, AND have a plan to recover should an non µArchitectural
    event become visible.

    If "leave no trace" includes not slowing down other concurrent memory accesses (e.g. from other CPUs), it might require some kind of
    priority scheme.

    Once you include the "plan to recover" you can loosen up the
    operational memory order rules and tighten them upon discovery
    of order violations.

    If you're worried about, say, a Spectre-like attack measuring the
    temperature or the power consumption of the CPU to try and guess what operations were performed (or what addresses were accessed, ...) speculatively, then you'd have to be yet more careful.

    If you're worried about an attacker that can witnessing the actual
    accesses to the DRAM, ...

    This brings to mind something we found in the Mc 88120 project.
    We could take a checkpoint (write to disk) and restore later--
    but we found we had to also recover the electrical state of the DRAM
    Bus and to RAS the proper banks to get the "same timing" on later
    runs on that checkpoint.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Nov 5 20:41:22 2024
    From Newsgroup: comp.arch

    On Mon, 28 Oct 2024 17:55:29 +0000, Stefan Monnier wrote:

    In MY 66000 ISA::
    a) RET is not predicted
    b) switch() is not predicted
    c) method calls are not predicted
    d) GOT calls are not predicted

    Which pretty much gets rid of the problem.

    By "the problem" I guess you mean "indirect jumps", right?

    Mostly, but I also changed on what the jump is predicted.

    c+d) GOT calls and method calls use the CALX instruction which
    loads IP from memory--thus not needing prediction--and not using
    a trampoline, either.

    I don't understand the "thus not needing prediction". Loading IP from
    memory takes time, doesn't it? Depending on your memory hierarchy and
    where the data is held, I'd say a minimum of 3 cycles and often more.
    What do you do during those cycles?

    It is not that these things don't need prediction, it is that you do the prediction and then verify the prediction using different data.

    For example: The classical way to do dense switches is a LD of the
    target
    address and a jump to the target. This requires verifying the address of
    the target. Whereas if you predict as JTT does, you verify by matching
    the index number (which is known earlier and since the table is
    read-only
    you don't need to verify the target address.

    So, it is not that you don't predict, it is that the data used to
    verify the prediction is more precise and available earlier.

    b) switch() is the JTT instruction

    IIUC this is basically like CALX, except with a bit more work to
    generate the address from which you fetch the IP (plus a bit more work
    to generate the IP from the fetched data as well). So: same question.

    So, CALX is a LDD IP,[address] and generally LDD IP,[IP,#GOT[extern]-.]
    Since GOT is not writeable from the thread using it (my architecture
    has this requirement) and GOT is DW aligned; we can
    a) avoid the LDAlign stage of the pipeline
    b) feed the loaded value directly into FETCH
    saving 2 cycles, while

    c) for those situations where we predict through GOT, we verify the
    GOT offset (DISP) instead of the loaded GOT value (an entry point).

    And since different kinds of data is used in the prediction and verification--the strategies used in the above paper are not
    actual attack strategies in My 66000 implementations.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Nov 8 17:54:45 2024
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    In case of branch predictor itself it means delay feedback by some
    number of clocks, which looks like minor cost.

    You can still make your next predictions based on "architectural state
    + pending predictions" if the pending predictions themselves only
    depend ultimately on the architectural state.

    OTOH delaying fetches from speculatively fetched addresses will
    increase latency on critical path, possibly leading to
    significant slowdown.

    I think you can similarly perform eagerly the fetches from speculatively fetched addresses but only if you can ensure that these will leave no
    trace if the speculation happens to fail.

    It looks extremaly hard if not impossible.

    So whether and how you can do it depends the definition of "leave no
    trace". E.g. Mitch argues you can do it if you can refrain from putting
    that info into the normal cache (where it would have to displace
    something else, thus leaving a trace) and instead have to keep it in
    what we could call a "speculative cache" but would likely be just some
    sort of load buffer.

    Alone that is clearly insufficient.

    If "leave no trace" includes not slowing down other concurrent memory accesses (e.g. from other CPUs), it might require some kind of
    priority scheme.

    First, one needs to ensure that the CPU performing speculative
    fetch will not slown down due to say resource contention. If you
    put some arbitrary limit like one or two speculative fetches in
    flight, that is likely to be detectable by the attacker and may
    leak information. If you want several ("arbitrarily many") speculative
    fetches without slowing down normal execution, that would mean highly overprovisioned machine.

    If you're worried about, say, a Spectre-like attack measuring the
    temperature or the power consumption of the CPU to try and guess what operations were performed (or what addresses were accessed, ...) speculatively, then you'd have to be yet more careful.

    I am mostly concerned with remote attacks. To block them it should
    be enough to ensure that machine never goes into thermal throttling
    (I consider adversary who is not able to directly monitor power
    or temperature, so only thing remaining is thermal throttling and
    its effect on execution time).
    --
    Waldek Hebisch
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Nov 8 18:48:47 2024
    From Newsgroup: comp.arch

    On Fri, 8 Nov 2024 17:54:45 +0000, Waldek Hebisch wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    In case of branch predictor itself it means delay feedback by some
    number of clocks, which looks like minor cost.

    You can still make your next predictions based on "architectural state
    + pending predictions" if the pending predictions themselves only
    depend ultimately on the architectural state.

    OTOH delaying fetches from speculatively fetched addresses will
    increase latency on critical path, possibly leading to
    significant slowdown.

    I think you can similarly perform eagerly the fetches from speculatively
    fetched addresses but only if you can ensure that these will leave no
    trace if the speculation happens to fail.

    It looks extremaly hard if not impossible.

    What kind of front end µArchitecture are you assuming that makes
    this hard (at all) ??

    Seems to me that is there is an instruction buffer and you load the
    speculative instructions into it, you can speculatively execute them
    and throw them away if they were not supposed to execute. All you
    have to avoid is filling I Cache if you were not supposed to have
    fetched them.

    Thus, not hard at all.

    So whether and how you can do it depends the definition of "leave no
    trace". E.g. Mitch argues you can do it if you can refrain from putting
    that info into the normal cache (where it would have to displace
    something else, thus leaving a trace) and instead have to keep it in
    what we could call a "speculative cache" but would likely be just some
    sort of load buffer.

    Alone that is clearly insufficient.

    Agreed insufficient all by itself but when combined...

    If "leave no trace" includes not slowing down other concurrent memory

    It does not.

    accesses (e.g. from other CPUs), it might require some kind of
    priority scheme.

    First, one needs to ensure that the CPU performing speculative
    fetch will not slown down due to say resource contention. If you
    put some arbitrary limit like one or two speculative fetches in

    Here, you use the word fetch as if it were a LD instruction. Is
    that what you intended ?? {{I reserve Fetch for instruction fetches
    only}}

    flight, that is likely to be detectable by the attacker and may
    leak information. If you want several ("arbitrarily many") speculative fetches without slowing down normal execution, that would mean highly overprovisioned machine.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Nov 10 02:05:03 2024
    From Newsgroup: comp.arch

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 8 Nov 2024 17:54:45 +0000, Waldek Hebisch wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    In case of branch predictor itself it means delay feedback by some
    number of clocks, which looks like minor cost.

    You can still make your next predictions based on "architectural state
    + pending predictions" if the pending predictions themselves only
    depend ultimately on the architectural state.

    OTOH delaying fetches from speculatively fetched addresses will
    increase latency on critical path, possibly leading to
    significant slowdown.

    I think you can similarly perform eagerly the fetches from speculatively >>> fetched addresses but only if you can ensure that these will leave no
    trace if the speculation happens to fail.

    It looks extremaly hard if not impossible.

    What kind of front end µArchitecture are you assuming that makes
    this hard (at all) ??

    Seems to me that is there is an instruction buffer and you load the speculative instructions into it, you can speculatively execute them
    and throw them away if they were not supposed to execute. All you
    have to avoid is filling I Cache if you were not supposed to have
    fetched them.

    Thus, not hard at all.

    I assume that fetches (reads) from the cache may be multi-clock activity.
    So, earlier access may slow down subsequent ones. I do not know
    details, but I looked at published data about Intel processors and
    this seem to be most plausible explanation for their L2 timings.

    IIUC data may go trough some crossbar switch which routes it to
    correct cache bank. It is not unusual for crossbars to have
    switching delays, that is changing destination incurs say one
    clock penalty.

    I assume machine competitive on price and performance, which means
    manufacturer will use slower blocks when they do not lower avarage
    speed. This may lead to slowdown of "real" activity due to
    competion with speculative actions.

    So whether and how you can do it depends the definition of "leave no
    trace". E.g. Mitch argues you can do it if you can refrain from putting >>> that info into the normal cache (where it would have to displace
    something else, thus leaving a trace) and instead have to keep it in
    what we could call a "speculative cache" but would likely be just some
    sort of load buffer.

    Alone that is clearly insufficient.

    Agreed insufficient all by itself but when combined...

    If "leave no trace" includes not slowing down other concurrent memory

    It does not.

    accesses (e.g. from other CPUs), it might require some kind of
    priority scheme.

    First, one needs to ensure that the CPU performing speculative
    fetch will not slown down due to say resource contention. If you
    put some arbitrary limit like one or two speculative fetches in

    Here, you use the word fetch as if it were a LD instruction. Is
    that what you intended ?? {{I reserve Fetch for instruction fetches
    only}}

    I mean any read access, both loads and instruction fetches.

    flight, that is likely to be detectable by the attacker and may
    leak information. If you want several ("arbitrarily many") speculative
    fetches without slowing down normal execution, that would mean highly
    overprovisioned machine.
    --
    Waldek Hebisch
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Nov 10 02:56:05 2024
    From Newsgroup: comp.arch

    On Sun, 10 Nov 2024 2:05:03 +0000, Waldek Hebisch wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 8 Nov 2024 17:54:45 +0000, Waldek Hebisch wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    In case of branch predictor itself it means delay feedback by some
    number of clocks, which looks like minor cost.

    You can still make your next predictions based on "architectural state >>>> + pending predictions" if the pending predictions themselves only
    depend ultimately on the architectural state.

    OTOH delaying fetches from speculatively fetched addresses will
    increase latency on critical path, possibly leading to
    significant slowdown.

    I think you can similarly perform eagerly the fetches from speculatively >>>> fetched addresses but only if you can ensure that these will leave no
    trace if the speculation happens to fail.

    It looks extremaly hard if not impossible.

    What kind of front end µArchitecture are you assuming that makes
    this hard (at all) ??

    Seems to me that is there is an instruction buffer and you load the
    speculative instructions into it, you can speculatively execute them
    and throw them away if they were not supposed to execute. All you
    have to avoid is filling I Cache if you were not supposed to have
    fetched them.

    Thus, not hard at all.

    I assume that fetches (reads) from the cache may be multi-clock
    activity.
    So, earlier access may slow down subsequent ones. I do not know
    details, but I looked at published data about Intel processors and
    this seem to be most plausible explanation for their L2 timings.

    While access to L1 caches may end up milti-cycle, as long as you
    can start a new fetch every cycle, misses do not slow-down subsequent
    accesses, just result deliveries (depending on aggressiveness of OoO architecture.

    IIUC data may go trough some crossbar switch which routes it to
    correct cache bank. It is not unusual for crossbars to have
    switching delays, that is changing destination incurs say one
    clock penalty.

    Caught up in minutia::

    Basically, as long as one can fetch "enough" instructions--the
    latency of the pipeline is not-very-material. For a 6-wide GBOoO
    machine, one needs the residual instructions (not issued last
    cycle); a set of new instructions on the sequential path, and
    a set of instruction on a predicted taken path. Given that,
    one can find enough instructions to DECODE and Issue. {{Where
    each set of instructions is <say> 8 words}}

    Generally speaking we build pipelines as short as will allow for
    our target frequency, as adding pipeline stages increases power
    and decreases performance.

    I assume machine competitive on price and performance, which means manufacturer will use slower blocks when they do not lower avarage
    speed. This may lead to slowdown of "real" activity due to
    competion with speculative actions.

    That was a necessity at 90nm and larger, but quickly became
    irrelevant at smaller geometries. Today we have a concept of
    "dark silicon"--any area that is not receiving a clock is
    (effectively) dark, and wastes little power.

    So whether and how you can do it depends the definition of "leave no
    trace". E.g. Mitch argues you can do it if you can refrain from putting >>>> that info into the normal cache (where it would have to displace
    something else, thus leaving a trace) and instead have to keep it in
    what we could call a "speculative cache" but would likely be just some >>>> sort of load buffer.

    Alone that is clearly insufficient.

    Agreed insufficient all by itself but when combined...

    If "leave no trace" includes not slowing down other concurrent memory

    It does not.

    accesses (e.g. from other CPUs), it might require some kind of
    priority scheme.

    First, one needs to ensure that the CPU performing speculative
    fetch will not slown down due to say resource contention. If you
    put some arbitrary limit like one or two speculative fetches in

    Here, you use the word fetch as if it were a LD instruction. Is
    that what you intended ?? {{I reserve Fetch for instruction fetches
    only}}

    I mean any read access, both loads and instruction fetches.

    flight, that is likely to be detectable by the attacker and may
    leak information. If you want several ("arbitrarily many") speculative
    fetches without slowing down normal execution, that would mean highly
    overprovisioned machine.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Nov 11 09:03:36 2024
    From Newsgroup: comp.arch

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    For example: The classical way to do dense switches is a LD of the
    target
    address and a jump to the target. This requires verifying the address of
    the target. Whereas if you predict as JTT does, you verify by matching
    the index number (which is known earlier and since the table is
    read-only
    you don't need to verify the target address.

    So, it is not that you don't predict, it is that the data used to
    verify the prediction is more precise and available earlier.

    Hmm... just wondering.

    How does this handle

    switch(a)
    {
    case 1:
    case 3:
    case 4:
    case 7:
    /* Do something */
    break;
    case 2:
    case 5:
    case 11:
    /* Do something else */
    break;
    default:
    /* Something else entirely */
    }

    Would the values 1,3,4 and 7 be considered different for
    prediciton purposes?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Nov 11 17:17:47 2024
    From Newsgroup: comp.arch

    On Mon, 11 Nov 2024 9:03:36 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    For example: The classical way to do dense switches is a LD of the
    target
    address and a jump to the target. This requires verifying the address of
    the target. Whereas if you predict as JTT does, you verify by matching
    the index number (which is known earlier and since the table is
    read-only
    you don't need to verify the target address.

    So, it is not that you don't predict, it is that the data used to
    verify the prediction is more precise and available earlier.

    Hmm... just wondering.

    How does this handle

    switch(a)
    {
    case 1:
    case 3:
    case 4:
    case 7:
    /* Do something */
    break;
    case 2:
    case 5:
    case 11:
    /* Do something else */
    break;
    default:
    /* Something else entirely */
    }

    Would the values 1,3,4 and 7 be considered different for
    prediciton purposes?

    Probably.

    You might take the target IPs and put them in an indirect prediction
    target buffer, so that by the time one has (a) one has the first inst- runctions at the target label. Then, verification is by index
    comparison\
    not by IP-address comparison.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Nov 11 15:36:50 2024
    From Newsgroup: comp.arch

    I don't understand the "thus not needing prediction". Loading IP from
    memory takes time, doesn't it? Depending on your memory hierarchy and
    where the data is held, I'd say a minimum of 3 cycles and often more.
    What do you do during those cycles?
    It is not that these things don't need prediction, it is that you do the prediction and then verify the prediction using different data.

    I see, so you still need something similar to a BTB for operations like
    JTT, but the delay until you can verify the prediction is shorter, which
    should presumably reduce the cost of mispredictions.

    For example: The classical way to do dense switches is a LD of the
    target address and a jump to the target. This requires verifying the
    address of the target. Whereas if you predict as JTT does, you verify
    by matching the index number (which is known earlier and since the
    table is read-only you don't need to verify the target address.

    Hmm... but in order not to have bubbles, your prediction structure still
    needs to give you a predicted target address (rather than a predicted
    index number), right?
    Also in order to be able to verify the index rather than the
    target address, your prediction structure will *also* need to give you
    the predicted index?
    So, rather than a BTB which just gives you a predicted target address,
    you'll need something that returns both a target address and the
    corresponding index (and the correspondence needs to be reliable if we
    verify only the index, tho I guess you could also verify both)?


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Nov 11 15:55:01 2024
    From Newsgroup: comp.arch

    If "leave no trace" includes not slowing down other concurrent memory
    accesses (e.g. from other CPUs), it might require some kind of
    priority scheme.
    First, one needs to ensure that the CPU performing speculative
    fetch will not slown down due to say resource contention.

    AFAIK there have been attempts at using resource contention as a side
    channel, indeed, but the effectiveness depends on the scale of the impact
    of resource contention compared to the "ambient noise".

    IOW, there we get into the realm of quantifying the leaks.

    But AFAIK if you limit yourself to L1 accesses I believe it's not too
    hard to avoid all resource contentions (i.e. make sure that
    no speculative access can slow down a less speculative access).

    If you want several ("arbitrarily many") speculative fetches without
    slowing down normal execution, that would mean highly
    overprovisioned machine.

    Once you're into the higher levels of cache, I don't have a clear
    picture of how hard/easy it would be to maintain the "no
    trace" guarantee. And the higher in the hierarchy you go, the more
    visible any slowdown will be, I expect.

    If you're worried about, say, a Spectre-like attack measuring the
    temperature or the power consumption of the CPU to try and guess what
    operations were performed (or what addresses were accessed, ...)
    speculatively, then you'd have to be yet more careful.
    I am mostly concerned with remote attacks. To block them it should
    be enough to ensure that machine never goes into thermal throttling
    (I consider adversary who is not able to directly monitor power
    or temperature, so only thing remaining is thermal throttling and
    its effect on execution time).

    Thermal and power information is usually made available to non-root
    processes, so I think we should assume remote attackers have access
    to it. Maybe the granularity is coarse enough to make the leak
    impractical (but we're again in the realm of quantifying leaks, here).


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Nov 11 21:23:00 2024
    From Newsgroup: comp.arch

    On Mon, 11 Nov 2024 20:36:50 +0000, Stefan Monnier wrote:

    I don't understand the "thus not needing prediction". Loading IP from
    memory takes time, doesn't it? Depending on your memory hierarchy and
    where the data is held, I'd say a minimum of 3 cycles and often more.
    What do you do during those cycles?
    It is not that these things don't need prediction, it is that you do the
    prediction and then verify the prediction using different data.

    I see, so you still need something similar to a BTB for operations like
    JTT, but the delay until you can verify the prediction is shorter, which should presumably reduce the cost of mispredictions.

    Instead of verifying you got the right Target address (62-bits) you
    can verify you picked the proper index from the table (8-ish bits).
    So, it is shorter in the pipeline, and fewer bits to verify (and index
    the tables with--less hashing,...)

    For example: The classical way to do dense switches is a LD of the
    target address and a jump to the target. This requires verifying the
    address of the target. Whereas if you predict as JTT does, you verify
    by matching the index number (which is known earlier and since the
    table is read-only you don't need to verify the target address.

    Hmm... but in order not to have bubbles, your prediction structure still needs to give you a predicted target address (rather than a predicted
    index number), right?

    Yes, but you use the predicted index number to find the predicted
    target IP. And then verify the index later.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Nov 11 17:10:14 2024
    From Newsgroup: comp.arch

    Hmm... but in order not to have bubbles, your prediction structure still
    needs to give you a predicted target address (rather than a predicted
    index number), right?
    Yes, but you use the predicted index number to find the predicted
    target IP.

    Hmm... but that would require fetching that info from memory.
    Can you do that without introducing bubbles?
    If you're lucky it's in the L1 Icache, but that still takes a couple
    cycles to get, doesn't it?
    Or do you have a dedicated "jump table cache" as part of your jump
    prediction tables? [ Even if you do, it still means your prediction
    has to first predict an index and then look it up in the table, which
    increases its latency. I don't know what kind of latency is used in
    current state of the art predictors, but IIUC any increase in latency
    can be quite costly. ]


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Nov 12 00:15:31 2024
    From Newsgroup: comp.arch

    On Mon, 11 Nov 2024 22:10:14 +0000, Stefan Monnier wrote:

    Hmm... but in order not to have bubbles, your prediction structure still >>> needs to give you a predicted target address (rather than a predicted
    index number), right?
    Yes, but you use the predicted index number to find the predicted
    target IP.

    Hmm... but that would require fetching that info from memory.
    Can you do that without introducing bubbles?

    In many/most (dynamic) cases, they have already been fetched and all
    that
    is needed is muxing the indexed field out of Instruction Buffer.

    If you're lucky it's in the L1 Icache, but that still takes a couple
    cycles to get, doesn't it?

    My 1-wide machine fetches 4-words per cycle.
    My 6-wide machine fetches 3 ½-cache-lines per cycle.

    Sure, if the indexed field is not already present, then you have to
    go fetch it, but since the table immediately follows JTT, most of the
    time, they have already arrived by the time JTT gets to DECODE.

    Or do you have a dedicated "jump table cache" as part of your jump
    prediction tables? [ Even if you do, it still means your prediction
    has to first predict an index and then look it up in the table, which increases its latency. I don't know what kind of latency is used in
    current state of the art predictors, but IIUC any increase in latency
    can be quite costly. ]

    For the wider OoO machine, you will have something like a jump table
    cache
    hashed with some branch history and other data to "whiten" the address
    space so one JTT table does not alias with another.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Nov 12 14:00:02 2024
    From Newsgroup: comp.arch

    Hmm... but in order not to have bubbles, your prediction structure still >>>> needs to give you a predicted target address (rather than a predicted
    index number), right?
    Yes, but you use the predicted index number to find the predicted
    target IP.
    Hmm... but that would require fetching that info from memory.
    Can you do that without introducing bubbles?

    In many/most (dynamic) cases, they have already been fetched and all
    that is needed is muxing the indexed field out of Instruction Buffer.

    I guess for small jump table that would work well, indeed, but for
    something like a bytecode interpreter, even if you can compact it to
    have only 16bit per entry, that still spans 512B. Is your IB large
    enough for that?

    If you're lucky it's in the L1 Icache, but that still takes a couple
    cycles to get, doesn't it?
    My 1-wide machine fetches 4-words per cycle.
    My 6-wide machine fetches 3 ½-cache-lines per cycle.

    Even with a 256B cache line width, it would take 2 cycles to get a 512B
    jump table into your IB, after which you still have to select (and
    compute, if the table is compacted) the corresponding target address,
    and only after that can you start fetching (which itself will suffer
    the L1 latency), so we're up to a 5-6 cycle bubble, no?


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Nov 12 21:31:12 2024
    From Newsgroup: comp.arch

    On Tue, 12 Nov 2024 19:00:02 +0000, Stefan Monnier wrote:

    Hmm... but in order not to have bubbles, your prediction structure still >>>>> needs to give you a predicted target address (rather than a predicted >>>>> index number), right?
    Yes, but you use the predicted index number to find the predicted
    target IP.
    Hmm... but that would require fetching that info from memory.
    Can you do that without introducing bubbles?

    In many/most (dynamic) cases, they have already been fetched and all
    that is needed is muxing the indexed field out of Instruction Buffer.

    I guess for small jump table that would work well, indeed, but for
    something like a bytecode interpreter, even if you can compact it to
    have only 16bit per entry, that still spans 512B. Is your IB large
    enough for that?

    For something like a bytecode interpreter, the prediction accuracy
    of the jump predictor is going to be epically bad.

    If you're lucky it's in the L1 Icache, but that still takes a couple
    cycles to get, doesn't it?
    My 1-wide machine fetches 4-words per cycle.
    My 6-wide machine fetches 3 ½-cache-lines per cycle.

    Even with a 256B cache line width, it would take 2 cycles to get a 512B
    jump table into your IB, after which you still have to select (and
    compute, if the table is compacted) the corresponding target address,
    and only after that can you start fetching (which itself will suffer
    the L1 latency), so we're up to a 5-6 cycle bubble, no?

    a) line size is 512-bits or 64-bytes.
    b) yes accessing the jump table when it is not in IB takes several
    ..extra cycles
    c) JTT performs range comparison, memory LD [IP+Ri<<sc], IP+=data<<2
    d) the cycle count should be 4-5 cycles or a bit faster than if SW
    ..executed the same semantic content.

    The LD does not have to be under the shadow of the range comparison
    as the data can be thrown away if the range comparison fails and we
    go to default.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Nov 12 17:21:25 2024
    From Newsgroup: comp.arch

    I guess for small jump table that would work well, indeed, but for
    something like a bytecode interpreter, even if you can compact it to
    have only 16bit per entry, that still spans 512B. Is your IB large
    enough for that?
    For something like a bytecode interpreter, the prediction accuracy
    of the jump predictor is going to be epically bad.

    Current Intel processors do a surprisingly good job of it.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brett@ggtgp@yahoo.com to comp.arch on Tue Nov 12 22:21:37 2024
    From Newsgroup: comp.arch

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 11 Nov 2024 22:10:14 +0000, Stefan Monnier wrote:

    Hmm... but in order not to have bubbles, your prediction structure still >>>> needs to give you a predicted target address (rather than a predicted
    index number), right?
    Yes, but you use the predicted index number to find the predicted
    target IP.

    Hmm... but that would require fetching that info from memory.
    Can you do that without introducing bubbles?

    In many/most (dynamic) cases, they have already been fetched and all
    that
    is needed is muxing the indexed field out of Instruction Buffer.

    If you're lucky it's in the L1 Icache, but that still takes a couple
    cycles to get, doesn't it?

    My 1-wide machine fetches 4-words per cycle.

    A two wide machine only adds a AGU and can do a load and math the same
    cycle, so how much bigger is that?

    My 6-wide machine fetches 3 ½-cache-lines per cycle.

    Sure, if the indexed field is not already present, then you have to
    go fetch it, but since the table immediately follows JTT, most of the
    time, they have already arrived by the time JTT gets to DECODE.

    Or do you have a dedicated "jump table cache" as part of your jump
    prediction tables? [ Even if you do, it still means your prediction
    has to first predict an index and then look it up in the table, which
    increases its latency. I don't know what kind of latency is used in
    current state of the art predictors, but IIUC any increase in latency
    can be quite costly. ]

    For the wider OoO machine, you will have something like a jump table
    cache
    hashed with some branch history and other data to "whiten" the address
    space so one JTT table does not alias with another.


    Stefan




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Nov 13 00:32:20 2024
    From Newsgroup: comp.arch

    On Tue, 12 Nov 2024 22:21:37 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:


    My 1-wide machine fetches 4-words per cycle.

    A two wide machine only adds a AGU and can do a load and math the same
    cycle, so how much bigger is that?

    Data path--only forwarding increases.
    PARSE and DECODE grow by 2.5×
    Register file grows by 25%
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 13 07:49:33 2024
    From Newsgroup: comp.arch

    Stefan Monnier wrote:
    I guess for small jump table that would work well, indeed, but for
    something like a bytecode interpreter, even if you can compact it to
    have only 16bit per entry, that still spans 512B. Is your IB large
    enough for that?
    For something like a bytecode interpreter, the prediction accuracy
    of the jump predictor is going to be epically bad.

    Current Intel processors do a surprisingly good job of it.

    I'm guessing this is the branch pattern recognizer at work, i.e some
    sequences of bytecode ops are much more common than a simple random walk through the opcodes, and those sequences make it possible to predict the
    next bytecode jump to be performed?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Nov 13 10:53:08 2024
    From Newsgroup: comp.arch

    I guess for small jump table that would work well, indeed, but for
    something like a bytecode interpreter, even if you can compact it to
    have only 16bit per entry, that still spans 512B. Is your IB large
    enough for that?
    For something like a bytecode interpreter, the prediction accuracy
    of the jump predictor is going to be epically bad.
    Current Intel processors do a surprisingly good job of it.
    I'm guessing this is the branch pattern recognizer at work, i.e some sequences of bytecode ops are much more common than a simple random walk through the opcodes, and those sequences make it possible to predict the
    next bytecode jump to be performed?

    Yes. If you consider the two levels of language at play (the CPU view
    at the level of machine language and the interpreter's view of the
    bytecode language), IIUC the branch history of the CPU view ends up
    being a usable approximation of the bytecode-level instruction pointer,
    so for those sections of your bytecode-level program which are sufficiently repetitive, the prediction applied by the CPU can correctly predict the
    next bytecode.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 13 20:54:56 2024
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    If you consider the two levels of language at play (the CPU view
    at the level of machine language and the interpreter's view of the
    bytecode language), IIUC the branch history of the CPU view ends up
    being a usable approximation of the bytecode-level instruction pointer,
    so for those sections of your bytecode-level program which are sufficiently repetitive, the prediction applied by the CPU can correctly predict the
    next bytecode.

    To me, that looks like the bytecode is insufficiently expressive, if
    an often-repeating sequence occurs :-)

    In the presence of a good predictor, it would still be an interesting
    question if a more compressed version would actually perform better.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Nov 13 16:38:19 2024
    From Newsgroup: comp.arch

    To me, that looks like the bytecode is insufficiently expressive, if
    an often-repeating sequence occurs :-)

    The "often-repeating sequence" is usually because there is a loop in the bytecode program.

    Are you suggesting that bytecode languages should shun loops, presumably
    by replacing such a loop by a dedicated bytecode instruction?
    That might be an option for those rare bytecode interpreters that are sophisticated enough to generate program-specific bytecode instructions,
    but those are sliding toward JIT compiler-land.

    For your run-of-the-mill bytecode interpreter which has ~200
    instructions defined manually once and for all, "often-repeating
    sequence" are the lay of the land, I think.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 14 06:54:04 2024
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    To me, that looks like the bytecode is insufficiently expressive, if
    an often-repeating sequence occurs :-)

    The "often-repeating sequence" is usually because there is a loop in the bytecode program.

    Are you suggesting that bytecode languages should shun loops, presumably
    by replacing such a loop by a dedicated bytecode instruction?

    Cray rulez :-)

    That might be an option for those rare bytecode interpreters that are sophisticated enough to generate program-specific bytecode instructions,
    but those are sliding toward JIT compiler-land.

    For your run-of-the-mill bytecode interpreter which has ~200
    instructions defined manually once and for all, "often-repeating
    sequence" are the lay of the land, I think.

    Dynamic, yes. For static, I think something like Mitch's VVM might
    also be interesting.
    --- Synchronet 3.20a-Linux NewsLink 1.114