Forum: War Ensemble BBS

Interpreters and indirect-branch prediction

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 13 08:20:27 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.

And with "jump predictor", you probably mean the indirect-branch
predictor.

On hardware of up to around 2005 where the indirect branches were at
best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
with one shared indirect branch (as generated by the usual for...{switch...{...}} idiom in C) results in misprediction rates of

80% [ertl&gregg03jilp]. With one a non-shared indirect branch per VM instruction implementation (as in typical threaded-code

implementations), we measured 50%-61% mispredictions for BTB with
2-bit counters in simulation [ertl&gregg03jilp] and similar results on
actual CPUs.

We simulated 2-level history-based indirect branch predictors [ertl&gregg03jilp], and they typically have less than 10%
mispredictions.

Hardware designers improved indirect-branch predictors to deal with
more history than BTBs with 2-bit counters already in Dothan (launched
2004) and it's descendents (in particular, the Core 2 line), judging
by the results of <http://www.complang.tuwien.ac.at/forth/threading/>
(look in the column "switch").

In 2015 Rohou et al. measured the prediction accuracy in Python,
JavaScript, and a CLI (.NET virtual machine, also known under other
names) on Nehalem (2009, Sandy Bridge (2011), Haswell (2013), and a
simulated TAGE1 branch predictor); unfortunately, they gave results in mispredictions per (kilo-)instructions rather than mispredictions per prediction and they used different interpreters, so you cannot compare
that work with our earlier work. In any case, they found good
improvements from Nehalem to Haswell.

Inspired by this work, I did some measurements of Gforth, disabling
various optimizations (which reduce the branch mispredictions) and
posted them here. You can find the sequence of postings here: <http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>.
However, I did not measure the prediction accuracy, because the idea
was to check whether the claims made in the paper (e.g., "Modern
predictors no longer need [superinstructions] to capture patterns in applications." or "[prediction accuracy] can not be considered as an
obstacle for performance anymore.") are true.

In the Haswell results, using an interpreter with non-shared indirect
branches (non-shared --no-dynamic) does indeed often have a similar
prediction accuracy as the interpreter with one shared indirect branch
(shared --no-dynamic), but eliminating the sharing still provides a
small speedup; and dynamic replication (--no-super), which results in
having one indirect branch per instance of the VM instruction in the interpreted program, i.e., no sharing between different instances of
the same VM instruction, provides a good reduction in branch
mispredictions and in execution time on several benchmarks. And given
that "non-shared --no-dynamic" and "--no-super" actually execute the
same number and sequence of instructions (only with replication in the "--no-super" case), any improvement in run-time is only due to the
improved branch prediction; here I only show these two columns, to
avoid confusion; look at <http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>
for the full data set:

Run time in seconds:

non-shared
--no- --no-
dynamic super
2.440 2.276 benchgc
1.524 1.064 brainless
3.416 2.288 brew
3.236 2.232 cd16sim
2.684 1.484 fcp
9.848 9.280 lexex

For mispredictions the picture was as follows:

non-shared
--no- --no-
dynamic super
17,005,970 12,584,698 benchgc
178,234,917 51,375,412 brainless
279,851,195 47,928,843 brew
297,000,355 70,879,605 cd16sim
293,473,631 33,496,775 fcp
12,267,065 8,269,104 lexex

So the improvment in branch prediction provides speedup factors
between 1.06 and 1.81 on these benchmarks. Plus, the results for the interpreters with a shared indirect branch (what the usual way of
writing a switch-based interpreter gives you) are even slower than for
the "non-shared --no-dynamic". So to paraphrase the title of Rohou et
al.'s paper, don't trust Rohou et al.

However, I expect that the indirect branch predictors have improved
since Haswell, though, and maybe we now see much lower numbers of mispredictions even for "non-shared --no-dynamic" and for "shared --no-dynamic", but I would have to measure that.

Anyway, back to Mitch Alsup's original claim: No, for a bytecode
interpreter the prediction accuracy of an indirect-branch predictor is
not necessarily going to be especially bad. It depends on how good
the indirect-branch predictor is, and on the benchmark.

@Article{ertl&gregg03jilp,
author = {M. Anton Ertl and David Gregg},
title = {The Structure and Performance of \emph{Efficient}
Interpreters},
journal = {The Journal of Instruction-Level Parallelism},
year = {2003},
volume = {5},
month = nov,
url = {https://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
url2 = {http://www.jilp.org/vol5/v5paper12.pdf},
note = {http://www.jilp.org/vol5/},
abstract = {Interpreters designed for high general-purpose
performance typically perform a large number of
indirect branches (3.2\%--13\% of all executed
instructions in our benchmarks). These branches
consume more than half of the run-time in a number
of configurations we simulated. We evaluate how
accurate various existing and proposed branch
prediction schemes are on a number of interpreters,
how the mispredictions affect the performance of the
interpreters and how two different interpreter
implementation techniques perform with various
branch predictors. We also suggest various ways in
which hardware designers, C compiler writers, and
interpreter writers can improve the performance of
interpreters.}
}

@InProceedings{rohou+15,
author = {Erven Rohou and Bharath Narasimha Swamy and Andr\'e Seznec},
title = {Branch Prediction and the Performance of
Interpreters --- Don't Trust Folklore},
booktitle = {Code Generation and Optimization (CGO)},
OPTpages = {},
year = {2015},
url = {https://hal.inria.fr/hal-01100647/document},
annote = {Evaluates the indirect branch predictors of Nehalem,
Sandy Bridge, and Haswell and the ITTAGE predictor
on the interpreters of Python, Spidermonkey
(JavaScript), and a (.NET) CLI interpreter running a
number of benchmarks. Haswell and ITTAGE are very
good at branch prediction for these benchmarks, and
they suggest that switch-based interpreters good
enough because of that. However, my own results
\url{news:<2015Sep7.142507@mips.complang.tuwien.ac.at>}
show that shared dispatch branches still can produce
a significant slowdown.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Nov 13 19:36:07 2024

From Newsgroup: comp.arch

On Wed, 13 Nov 2024 8:20:27 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.

And with "jump predictor", you probably mean the indirect-branch
predictor.

On hardware of up to around 2005 where the indirect branches were at
best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
with one shared indirect branch (as generated by the usual for...{switch...{...}} idiom in C) results in misprediction rates of

80% [ertl&gregg03jilp].

That certainly qualifies as bad.

With one a non-shared indirect branch per VM instruction implementation (as in typical threaded-code
implementations), we measured 50%-61% mispredictions for BTB with
2-bit counters in simulation [ertl&gregg03jilp] and similar results on
actual CPUs.

50% misprediction rate qualifies as bad.

We simulated 2-level history-based indirect branch predictors [ertl&gregg03jilp], and they typically have less than 10%
mispredictions.

10% misprediction rate qualifies as "not so bad".
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 13 14:49:03 2024

From Newsgroup: comp.arch

On 11/13/2024 1:36 PM, MitchAlsup1 wrote:

On Wed, 13 Nov 2024 8:20:27 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.

And with "jump predictor", you probably mean the indirect-branch
predictor.

On hardware of up to around 2005 where the indirect branches were at
best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
with one shared indirect branch (as generated by the usual
for...{switch...{...}} idiom in C) results in misprediction rates of

80% [ertl&gregg03jilp].

That certainly qualifies as bad.

Can note that in my branch predictors, I ended up with a special case
for "predict the opposite of last time" to try to deal with branches
which alternated between being taken and not taken (which could get
greater than a 50% mispredict with the more traditional approaches, like
the 2-bit state model; the transitory states would allow it to adapt to
these cases).

A more elaborate state machine could potentially predict simple RLE
patterns, but designing the state-transition graph would be harder.

So, say:
000: Weakly Not Taken
001: Strong Not Taken
010: Weakly Transitory Not Taken
011: Strong Transitory Not Taken
100: Weakly Taken
101: Strong Taken
110: Weakly Transitory Taken
111: Strong Transitory Taken

Where the transitory states always flip-flop.

Had done better than a 2-bit state machine (with no transitory states).
Adding additional "strength" levels did not help with accuracy in past attempts (actually seemed to make it slightly worse).

Could maybe augment this with a 2-bit periodic field:
00: Period 0, 0000, 1111, 1010, 0101
01: Period 1, 001001, 010010, 100100, -, 110110, 101101, 011011
1z: Period 2, 00010001, ..., 11101110, ...

Though, with the longer periods always assuming a "weak" state, here the pattern also needs to be able to express the phase of the pattern.

Best state transition diagram is less obvious, correct prediction likely
jumps to the next stage of the predicted pattern, whereas a mispredict
jumps to a different pattern.

One other possibility being to use 5 or 6 bits of state and throw it at
a genetic algorithm and see what the genetic algorithm comes up with
(probably using some variants of the existing state machine as an
initial seed guess).

Say, ranking each possible state machine based on how well it can
predict a series of simple repetitive bit patterns of various periods.

Tempting to consider using a "majority of 8" mutator, as in this case
one may want to limit mutation rate (and "majority of 3" might not be
stable enough, majority of 5 lacks a cheap way to evaluate, but
"majority of 8" allows representing each bit as a byte and using a
lookup table). Well, or maybe "majority of 7".

May try messing with it if anyone is interested.

...

With one a non-shared indirect branch per VM
instruction implementation (as in typical threaded-code
implementations), we measured 50%-61% mispredictions for BTB with
2-bit counters in simulation [ertl&gregg03jilp] and similar results on
actual CPUs.

50% misprediction rate qualifies as bad.

We simulated 2-level history-based indirect branch predictors
[ertl&gregg03jilp], and they typically have less than 10%
mispredictions.

10% misprediction rate qualifies as "not so bad".

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 13 20:33:19 2024

From Newsgroup: comp.arch

On 11/13/2024 2:49 PM, BGB wrote:

On 11/13/2024 1:36 PM, MitchAlsup1 wrote:

On Wed, 13 Nov 2024 8:20:27 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.

And with "jump predictor", you probably mean the indirect-branch
predictor.

On hardware of up to around 2005 where the indirect branches were at
best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
with one shared indirect branch (as generated by the usual
for...{switch...{...}} idiom in C) results in misprediction rates of

80% [ertl&gregg03jilp].

That certainly qualifies as bad.

Can note that in my branch predictors, I ended up with a special case
for "predict the opposite of last time" to try to deal with branches
which alternated between being taken and not taken (which could get
greater than a 50% mispredict with the more traditional approaches, like
the 2-bit state model; the transitory states would allow it to adapt to these cases).

A more elaborate state machine could potentially predict simple RLE patterns, but designing the state-transition graph would be harder.

So, say:
000: Weakly Not Taken
001: Strong Not Taken
010: Weakly Transitory Not Taken
011: Strong Transitory Not Taken
100: Weakly Taken
101: Strong Taken
110: Weakly Transitory Taken
111: Strong Transitory Taken

Where the transitory states always flip-flop.

Had done better than a 2-bit state machine (with no transitory states). Adding additional "strength" levels did not help with accuracy in past attempts (actually seemed to make it slightly worse).

Could maybe augment this with a 2-bit periodic field:
00: Period 0, 0000, 1111, 1010, 0101
01: Period 1, 001001, 010010, 100100, -, 110110, 101101, 011011
1z: Period 2, 00010001, ..., 11101110, ...

Though, with the longer periods always assuming a "weak" state, here the pattern also needs to be able to express the phase of the pattern.

Best state transition diagram is less obvious, correct prediction likely jumps to the next stage of the predicted pattern, whereas a mispredict
jumps to a different pattern.

One other possibility being to use 5 or 6 bits of state and throw it at
a genetic algorithm and see what the genetic algorithm comes up with (probably using some variants of the existing state machine as an
initial seed guess).

Say, ranking each possible state machine based on how well it can
predict a series of simple repetitive bit patterns of various periods.

Tempting to consider using a "majority of 8" mutator, as in this case
one may want to limit mutation rate (and "majority of 3" might not be
stable enough, majority of 5 lacks a cheap way to evaluate, but
"majority of 8" allows representing each bit as a byte and using a
lookup table). Well, or maybe "majority of 7".

May try messing with it if anyone is interested.

...

Well, went and messed around with trying to get a genetic algorithm to
solve this...

Example of the sorts of results I am getting:
6'b00000_0: tPreExCntB=6'b00001_1;
6'b00000_1: tPreExCntB=6'b00110_0;
6'b00001_0: tPreExCntB=6'b00001_0;
6'b00001_1: tPreExCntB=6'b00100_0;
6'b00010_0: tPreExCntB=6'b00000_0;
6'b00010_1: tPreExCntB=6'b00111_0;
6'b00011_0: tPreExCntB=6'b00010_0;
6'b00011_1: tPreExCntB=6'b00111_1;
6'b00100_0: tPreExCntB=6'b00011_0;
6'b00100_1: tPreExCntB=6'b00100_1;
6'b00101_0: tPreExCntB=6'b11100_0;
6'b00101_1: tPreExCntB=6'b01100_1;
6'b00110_0: tPreExCntB=6'b01011_0;
6'b00110_1: tPreExCntB=6'b00100_1;
6'b00111_0: tPreExCntB=6'b00010_1;
6'b00111_1: tPreExCntB=6'b01110_1;
6'b01000_0: tPreExCntB=6'b11111_1;
6'b01000_1: tPreExCntB=6'b11100_0;
6'b01001_0: tPreExCntB=6'b11101_1;
6'b01001_1: tPreExCntB=6'b11010_1;
6'b01010_0: tPreExCntB=6'b10011_1;
6'b01010_1: tPreExCntB=6'b00000_0;
6'b01011_0: tPreExCntB=6'b00000_1;
6'b01011_1: tPreExCntB=6'b10101_1;
6'b01100_0: tPreExCntB=6'b11111_0;
6'b01100_1: tPreExCntB=6'b00001_0;
6'b01101_0: tPreExCntB=6'b00000_0;
6'b01101_1: tPreExCntB=6'b11100_1;
6'b01110_0: tPreExCntB=6'b10011_0;
6'b01110_1: tPreExCntB=6'b10110_0;
6'b01111_0: tPreExCntB=6'b00000_1;
6'b01111_1: tPreExCntB=6'b00110_1;
6'b10000_0: tPreExCntB=6'b10001_0;
6'b10000_1: tPreExCntB=6'b00111_1;
6'b10001_0: tPreExCntB=6'b01001_0;
6'b10001_1: tPreExCntB=6'b11000_1;
6'b10010_0: tPreExCntB=6'b01001_0;
6'b10010_1: tPreExCntB=6'b00111_1;
6'b10011_0: tPreExCntB=6'b00100_0;
6'b10011_1: tPreExCntB=6'b10111_1;
6'b10100_0: tPreExCntB=6'b00100_0;
6'b10100_1: tPreExCntB=6'b00100_1;
6'b10101_0: tPreExCntB=6'b00010_1;
6'b10101_1: tPreExCntB=6'b01101_1;
6'b10110_0: tPreExCntB=6'b00011_1;
6'b10110_1: tPreExCntB=6'b01001_1;
6'b10111_0: tPreExCntB=6'b00011_1;
6'b10111_1: tPreExCntB=6'b01010_0;
6'b11000_0: tPreExCntB=6'b00111_0;
6'b11000_1: tPreExCntB=6'b00110_1;
6'b11001_0: tPreExCntB=6'b01111_1;
6'b11001_1: tPreExCntB=6'b00000_1;
6'b11010_0: tPreExCntB=6'b00111_1;
6'b11010_1: tPreExCntB=6'b10111_0;
6'b11011_0: tPreExCntB=6'b11010_0;
6'b11011_1: tPreExCntB=6'b10110_1;
6'b11100_0: tPreExCntB=6'b10000_0;
6'b11100_1: tPreExCntB=6'b01001_0;
6'b11101_0: tPreExCntB=6'b00110_1;
6'b11101_1: tPreExCntB=6'b10101_1;
6'b11110_0: tPreExCntB=6'b00011_0;
6'b11110_1: tPreExCntB=6'b00000_0;
6'b11111_0: tPreExCntB=6'b00111_1;
6'b11111_1: tPreExCntB=6'b01011_1;

The logic is admittedly difficult to decipher from the bit pattern.

Seems it is able to achieve 100% accuracy at 1..3 bit repeating
patterns, but is unable to fully learn 4..7 bit patterns.

Granted, longer patterns would need more working state to be able to "remember" the position in the pattern.

In this case, there is 5 bits of state, with the final bit as the
input/output bit (input is what state we are in, and the branch
direction; the output is the next state, and the predicted branch
direction for the next branch).

General ranking process was to put it in a random initial state, expose
it to 16-48 bits from a randomly selected pattern, followed by 16-48
bits of the target pattern. After this, it is used to predict each bit
of the pattern, counting up the number of mispredicted bits.

Accuracy seems to be higher than with 3 or 4 bits of state (which can
only learn 1/2 bit patterns).

Here, 6 bits doesn't seem to have much advantage over 5 bits, but the
drawback of not fitting into a LUT6.

It seems 7 bits could learn 4 and 5 bit patterns, but likely isn't worth
the cost.

Not sure how well it would do at actual branch prediction. Since, as
noted, it was trained on simple repeating patterns rather than actual
branch data.

With one a non-shared indirect branch per VM
instruction implementation (as in typical threaded-code
implementations), we measured 50%-61% mispredictions for BTB with
2-bit counters in simulation [ertl&gregg03jilp] and similar results on
actual CPUs.

50% misprediction rate qualifies as bad.

We simulated 2-level history-based indirect branch predictors
[ertl&gregg03jilp], and they typically have less than 10%
mispredictions.

10% misprediction rate qualifies as "not so bad".

--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	991
Nodes:	10 (0 / 10)
Uptime:	119:51:15
Calls:	12,958
Files:	186,574
Messages:	3,265,641

Interpreters and indirect-branch prediction

Who's Online

System Info