• Interpreters and indirect-branch prediction

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 13 08:20:27 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    For something like a bytecode interpreter, the prediction accuracy
    of the jump predictor is going to be epically bad.

    And with "jump predictor", you probably mean the indirect-branch
    predictor.

    On hardware of up to around 2005 where the indirect branches were at
    best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
    with one shared indirect branch (as generated by the usual for...{switch...{...}} idiom in C) results in misprediction rates of
    80% [ertl&gregg03jilp]. With one a non-shared indirect branch per VM instruction implementation (as in typical threaded-code
    implementations), we measured 50%-61% mispredictions for BTB with
    2-bit counters in simulation [ertl&gregg03jilp] and similar results on
    actual CPUs.

    We simulated 2-level history-based indirect branch predictors [ertl&gregg03jilp], and they typically have less than 10%
    mispredictions.

    Hardware designers improved indirect-branch predictors to deal with
    more history than BTBs with 2-bit counters already in Dothan (launched
    2004) and it's descendents (in particular, the Core 2 line), judging
    by the results of <http://www.complang.tuwien.ac.at/forth/threading/>
    (look in the column "switch").

    In 2015 Rohou et al. measured the prediction accuracy in Python,
    JavaScript, and a CLI (.NET virtual machine, also known under other
    names) on Nehalem (2009, Sandy Bridge (2011), Haswell (2013), and a
    simulated TAGE1 branch predictor); unfortunately, they gave results in mispredictions per (kilo-)instructions rather than mispredictions per prediction and they used different interpreters, so you cannot compare
    that work with our earlier work. In any case, they found good
    improvements from Nehalem to Haswell.

    Inspired by this work, I did some measurements of Gforth, disabling
    various optimizations (which reduce the branch mispredictions) and
    posted them here. You can find the sequence of postings here: <http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>.
    However, I did not measure the prediction accuracy, because the idea
    was to check whether the claims made in the paper (e.g., "Modern
    predictors no longer need [superinstructions] to capture patterns in applications." or "[prediction accuracy] can not be considered as an
    obstacle for performance anymore.") are true.

    In the Haswell results, using an interpreter with non-shared indirect
    branches (non-shared --no-dynamic) does indeed often have a similar
    prediction accuracy as the interpreter with one shared indirect branch
    (shared --no-dynamic), but eliminating the sharing still provides a
    small speedup; and dynamic replication (--no-super), which results in
    having one indirect branch per instance of the VM instruction in the interpreted program, i.e., no sharing between different instances of
    the same VM instruction, provides a good reduction in branch
    mispredictions and in execution time on several benchmarks. And given
    that "non-shared --no-dynamic" and "--no-super" actually execute the
    same number and sequence of instructions (only with replication in the "--no-super" case), any improvement in run-time is only due to the
    improved branch prediction; here I only show these two columns, to
    avoid confusion; look at <http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>
    for the full data set:

    Run time in seconds:

    non-shared
    --no- --no-
    dynamic super
    2.440 2.276 benchgc
    1.524 1.064 brainless
    3.416 2.288 brew
    3.236 2.232 cd16sim
    2.684 1.484 fcp
    9.848 9.280 lexex

    For mispredictions the picture was as follows:

    non-shared
    --no- --no-
    dynamic super
    17,005,970 12,584,698 benchgc
    178,234,917 51,375,412 brainless
    279,851,195 47,928,843 brew
    297,000,355 70,879,605 cd16sim
    293,473,631 33,496,775 fcp
    12,267,065 8,269,104 lexex

    So the improvment in branch prediction provides speedup factors
    between 1.06 and 1.81 on these benchmarks. Plus, the results for the interpreters with a shared indirect branch (what the usual way of
    writing a switch-based interpreter gives you) are even slower than for
    the "non-shared --no-dynamic". So to paraphrase the title of Rohou et
    al.'s paper, don't trust Rohou et al.

    However, I expect that the indirect branch predictors have improved
    since Haswell, though, and maybe we now see much lower numbers of mispredictions even for "non-shared --no-dynamic" and for "shared --no-dynamic", but I would have to measure that.

    Anyway, back to Mitch Alsup's original claim: No, for a bytecode
    interpreter the prediction accuracy of an indirect-branch predictor is
    not necessarily going to be especially bad. It depends on how good
    the indirect-branch predictor is, and on the benchmark.

    @Article{ertl&gregg03jilp,
    author = {M. Anton Ertl and David Gregg},
    title = {The Structure and Performance of \emph{Efficient}
    Interpreters},
    journal = {The Journal of Instruction-Level Parallelism},
    year = {2003},
    volume = {5},
    month = nov,
    url = {https://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
    url2 = {http://www.jilp.org/vol5/v5paper12.pdf},
    note = {http://www.jilp.org/vol5/},
    abstract = {Interpreters designed for high general-purpose
    performance typically perform a large number of
    indirect branches (3.2\%--13\% of all executed
    instructions in our benchmarks). These branches
    consume more than half of the run-time in a number
    of configurations we simulated. We evaluate how
    accurate various existing and proposed branch
    prediction schemes are on a number of interpreters,
    how the mispredictions affect the performance of the
    interpreters and how two different interpreter
    implementation techniques perform with various
    branch predictors. We also suggest various ways in
    which hardware designers, C compiler writers, and
    interpreter writers can improve the performance of
    interpreters.}
    }

    @InProceedings{rohou+15,
    author = {Erven Rohou and Bharath Narasimha Swamy and Andr\'e Seznec},
    title = {Branch Prediction and the Performance of
    Interpreters --- Don't Trust Folklore},
    booktitle = {Code Generation and Optimization (CGO)},
    OPTpages = {},
    year = {2015},
    url = {https://hal.inria.fr/hal-01100647/document},
    annote = {Evaluates the indirect branch predictors of Nehalem,
    Sandy Bridge, and Haswell and the ITTAGE predictor
    on the interpreters of Python, Spidermonkey
    (JavaScript), and a (.NET) CLI interpreter running a
    number of benchmarks. Haswell and ITTAGE are very
    good at branch prediction for these benchmarks, and
    they suggest that switch-based interpreters good
    enough because of that. However, my own results
    \url{news:<2015Sep7.142507@mips.complang.tuwien.ac.at>}
    show that shared dispatch branches still can produce
    a significant slowdown.}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Nov 13 19:36:07 2024
    From Newsgroup: comp.arch

    On Wed, 13 Nov 2024 8:20:27 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    For something like a bytecode interpreter, the prediction accuracy
    of the jump predictor is going to be epically bad.

    And with "jump predictor", you probably mean the indirect-branch
    predictor.

    On hardware of up to around 2005 where the indirect branches were at
    best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
    with one shared indirect branch (as generated by the usual for...{switch...{...}} idiom in C) results in misprediction rates of
    80% [ertl&gregg03jilp].

    That certainly qualifies as bad.

    With one a non-shared indirect branch per VM instruction implementation (as in typical threaded-code
    implementations), we measured 50%-61% mispredictions for BTB with
    2-bit counters in simulation [ertl&gregg03jilp] and similar results on
    actual CPUs.

    50% misprediction rate qualifies as bad.

    We simulated 2-level history-based indirect branch predictors [ertl&gregg03jilp], and they typically have less than 10%
    mispredictions.

    10% misprediction rate qualifies as "not so bad".
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 13 14:49:03 2024
    From Newsgroup: comp.arch

    On 11/13/2024 1:36 PM, MitchAlsup1 wrote:
    On Wed, 13 Nov 2024 8:20:27 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    For something like a bytecode interpreter, the prediction accuracy
    of the jump predictor is going to be epically bad.

    And with "jump predictor", you probably mean the indirect-branch
    predictor.

    On hardware of up to around 2005 where the indirect branches were at
    best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
    with one shared indirect branch (as generated by the usual
    for...{switch...{...}} idiom in C) results in misprediction rates of
    80% [ertl&gregg03jilp].

    That certainly qualifies as bad.


    Can note that in my branch predictors, I ended up with a special case
    for "predict the opposite of last time" to try to deal with branches
    which alternated between being taken and not taken (which could get
    greater than a 50% mispredict with the more traditional approaches, like
    the 2-bit state model; the transitory states would allow it to adapt to
    these cases).

    A more elaborate state machine could potentially predict simple RLE
    patterns, but designing the state-transition graph would be harder.

    So, say:
    000: Weakly Not Taken
    001: Strong Not Taken
    010: Weakly Transitory Not Taken
    011: Strong Transitory Not Taken
    100: Weakly Taken
    101: Strong Taken
    110: Weakly Transitory Taken
    111: Strong Transitory Taken

    Where the transitory states always flip-flop.

    Had done better than a 2-bit state machine (with no transitory states).
    Adding additional "strength" levels did not help with accuracy in past attempts (actually seemed to make it slightly worse).



    Could maybe augment this with a 2-bit periodic field:
    00: Period 0, 0000, 1111, 1010, 0101
    01: Period 1, 001001, 010010, 100100, -, 110110, 101101, 011011
    1z: Period 2, 00010001, ..., 11101110, ...

    Though, with the longer periods always assuming a "weak" state, here the pattern also needs to be able to express the phase of the pattern.


    Best state transition diagram is less obvious, correct prediction likely
    jumps to the next stage of the predicted pattern, whereas a mispredict
    jumps to a different pattern.


    One other possibility being to use 5 or 6 bits of state and throw it at
    a genetic algorithm and see what the genetic algorithm comes up with
    (probably using some variants of the existing state machine as an
    initial seed guess).

    Say, ranking each possible state machine based on how well it can
    predict a series of simple repetitive bit patterns of various periods.

    Tempting to consider using a "majority of 8" mutator, as in this case
    one may want to limit mutation rate (and "majority of 3" might not be
    stable enough, majority of 5 lacks a cheap way to evaluate, but
    "majority of 8" allows representing each bit as a byte and using a
    lookup table). Well, or maybe "majority of 7".


    May try messing with it if anyone is interested.

    ...




                            With one a non-shared indirect branch per VM
    instruction implementation (as in typical threaded-code
    implementations), we measured 50%-61% mispredictions for BTB with
    2-bit counters in simulation [ertl&gregg03jilp] and similar results on
    actual CPUs.

    50% misprediction rate qualifies as bad.

    We simulated 2-level history-based indirect branch predictors
    [ertl&gregg03jilp], and they typically have less than 10%
    mispredictions.

    10% misprediction rate qualifies as "not so bad".

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 13 20:33:19 2024
    From Newsgroup: comp.arch

    On 11/13/2024 2:49 PM, BGB wrote:
    On 11/13/2024 1:36 PM, MitchAlsup1 wrote:
    On Wed, 13 Nov 2024 8:20:27 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    For something like a bytecode interpreter, the prediction accuracy
    of the jump predictor is going to be epically bad.

    And with "jump predictor", you probably mean the indirect-branch
    predictor.

    On hardware of up to around 2005 where the indirect branches were at
    best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
    with one shared indirect branch (as generated by the usual
    for...{switch...{...}} idiom in C) results in misprediction rates of
    80% [ertl&gregg03jilp].

    That certainly qualifies as bad.


    Can note that in my branch predictors, I ended up with a special case
    for "predict the opposite of last time" to try to deal with branches
    which alternated between being taken and not taken (which could get
    greater than a 50% mispredict with the more traditional approaches, like
    the 2-bit state model; the transitory states would allow it to adapt to these cases).

    A more elaborate state machine could potentially predict simple RLE patterns, but designing the state-transition graph would be harder.

    So, say:
      000: Weakly Not Taken
      001: Strong Not Taken
      010: Weakly Transitory Not Taken
      011: Strong Transitory Not Taken
      100: Weakly Taken
      101: Strong Taken
      110: Weakly Transitory Taken
      111: Strong Transitory Taken

    Where the transitory states always flip-flop.

    Had done better than a 2-bit state machine (with no transitory states). Adding additional "strength" levels did not help with accuracy in past attempts (actually seemed to make it slightly worse).



    Could maybe augment this with a 2-bit periodic field:
      00: Period 0, 0000, 1111, 1010, 0101
      01: Period 1, 001001, 010010, 100100, -, 110110, 101101, 011011
      1z: Period 2, 00010001, ..., 11101110, ...

    Though, with the longer periods always assuming a "weak" state, here the pattern also needs to be able to express the phase of the pattern.


    Best state transition diagram is less obvious, correct prediction likely jumps to the next stage of the predicted pattern, whereas a mispredict
    jumps to a different pattern.


    One other possibility being to use 5 or 6 bits of state and throw it at
    a genetic algorithm and see what the genetic algorithm comes up with (probably using some variants of the existing state machine as an
    initial seed guess).

    Say, ranking each possible state machine based on how well it can
    predict a series of simple repetitive bit patterns of various periods.

    Tempting to consider using a "majority of 8" mutator, as in this case
    one may want to limit mutation rate (and "majority of 3" might not be
    stable enough, majority of 5 lacks a cheap way to evaluate, but
    "majority of 8" allows representing each bit as a byte and using a
    lookup table). Well, or maybe "majority of 7".


    May try messing with it if anyone is interested.

    ...


    Well, went and messed around with trying to get a genetic algorithm to
    solve this...

    Example of the sorts of results I am getting:
    6'b00000_0: tPreExCntB=6'b00001_1;
    6'b00000_1: tPreExCntB=6'b00110_0;
    6'b00001_0: tPreExCntB=6'b00001_0;
    6'b00001_1: tPreExCntB=6'b00100_0;
    6'b00010_0: tPreExCntB=6'b00000_0;
    6'b00010_1: tPreExCntB=6'b00111_0;
    6'b00011_0: tPreExCntB=6'b00010_0;
    6'b00011_1: tPreExCntB=6'b00111_1;
    6'b00100_0: tPreExCntB=6'b00011_0;
    6'b00100_1: tPreExCntB=6'b00100_1;
    6'b00101_0: tPreExCntB=6'b11100_0;
    6'b00101_1: tPreExCntB=6'b01100_1;
    6'b00110_0: tPreExCntB=6'b01011_0;
    6'b00110_1: tPreExCntB=6'b00100_1;
    6'b00111_0: tPreExCntB=6'b00010_1;
    6'b00111_1: tPreExCntB=6'b01110_1;
    6'b01000_0: tPreExCntB=6'b11111_1;
    6'b01000_1: tPreExCntB=6'b11100_0;
    6'b01001_0: tPreExCntB=6'b11101_1;
    6'b01001_1: tPreExCntB=6'b11010_1;
    6'b01010_0: tPreExCntB=6'b10011_1;
    6'b01010_1: tPreExCntB=6'b00000_0;
    6'b01011_0: tPreExCntB=6'b00000_1;
    6'b01011_1: tPreExCntB=6'b10101_1;
    6'b01100_0: tPreExCntB=6'b11111_0;
    6'b01100_1: tPreExCntB=6'b00001_0;
    6'b01101_0: tPreExCntB=6'b00000_0;
    6'b01101_1: tPreExCntB=6'b11100_1;
    6'b01110_0: tPreExCntB=6'b10011_0;
    6'b01110_1: tPreExCntB=6'b10110_0;
    6'b01111_0: tPreExCntB=6'b00000_1;
    6'b01111_1: tPreExCntB=6'b00110_1;
    6'b10000_0: tPreExCntB=6'b10001_0;
    6'b10000_1: tPreExCntB=6'b00111_1;
    6'b10001_0: tPreExCntB=6'b01001_0;
    6'b10001_1: tPreExCntB=6'b11000_1;
    6'b10010_0: tPreExCntB=6'b01001_0;
    6'b10010_1: tPreExCntB=6'b00111_1;
    6'b10011_0: tPreExCntB=6'b00100_0;
    6'b10011_1: tPreExCntB=6'b10111_1;
    6'b10100_0: tPreExCntB=6'b00100_0;
    6'b10100_1: tPreExCntB=6'b00100_1;
    6'b10101_0: tPreExCntB=6'b00010_1;
    6'b10101_1: tPreExCntB=6'b01101_1;
    6'b10110_0: tPreExCntB=6'b00011_1;
    6'b10110_1: tPreExCntB=6'b01001_1;
    6'b10111_0: tPreExCntB=6'b00011_1;
    6'b10111_1: tPreExCntB=6'b01010_0;
    6'b11000_0: tPreExCntB=6'b00111_0;
    6'b11000_1: tPreExCntB=6'b00110_1;
    6'b11001_0: tPreExCntB=6'b01111_1;
    6'b11001_1: tPreExCntB=6'b00000_1;
    6'b11010_0: tPreExCntB=6'b00111_1;
    6'b11010_1: tPreExCntB=6'b10111_0;
    6'b11011_0: tPreExCntB=6'b11010_0;
    6'b11011_1: tPreExCntB=6'b10110_1;
    6'b11100_0: tPreExCntB=6'b10000_0;
    6'b11100_1: tPreExCntB=6'b01001_0;
    6'b11101_0: tPreExCntB=6'b00110_1;
    6'b11101_1: tPreExCntB=6'b10101_1;
    6'b11110_0: tPreExCntB=6'b00011_0;
    6'b11110_1: tPreExCntB=6'b00000_0;
    6'b11111_0: tPreExCntB=6'b00111_1;
    6'b11111_1: tPreExCntB=6'b01011_1;

    The logic is admittedly difficult to decipher from the bit pattern.


    Seems it is able to achieve 100% accuracy at 1..3 bit repeating
    patterns, but is unable to fully learn 4..7 bit patterns.

    Granted, longer patterns would need more working state to be able to "remember" the position in the pattern.

    In this case, there is 5 bits of state, with the final bit as the
    input/output bit (input is what state we are in, and the branch
    direction; the output is the next state, and the predicted branch
    direction for the next branch).

    General ranking process was to put it in a random initial state, expose
    it to 16-48 bits from a randomly selected pattern, followed by 16-48
    bits of the target pattern. After this, it is used to predict each bit
    of the pattern, counting up the number of mispredicted bits.



    Accuracy seems to be higher than with 3 or 4 bits of state (which can
    only learn 1/2 bit patterns).

    Here, 6 bits doesn't seem to have much advantage over 5 bits, but the
    drawback of not fitting into a LUT6.

    It seems 7 bits could learn 4 and 5 bit patterns, but likely isn't worth
    the cost.


    Not sure how well it would do at actual branch prediction. Since, as
    noted, it was trained on simple repeating patterns rather than actual
    branch data.





                            With one a non-shared indirect branch per VM
    instruction implementation (as in typical threaded-code
    implementations), we measured 50%-61% mispredictions for BTB with
    2-bit counters in simulation [ertl&gregg03jilp] and similar results on
    actual CPUs.

    50% misprediction rate qualifies as bad.

    We simulated 2-level history-based indirect branch predictors
    [ertl&gregg03jilp], and they typically have less than 10%
    mispredictions.

    10% misprediction rate qualifies as "not so bad".


    --- Synchronet 3.20a-Linux NewsLink 1.114