So if were to implement a spinlock using the above instructions
something along the lines of
..L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.
Loads from the locked critical region could move forward of
the stxr but there's a control dependency from cbnz branch
instruction so they would be speculative loads until the
loop exited.
You'd still potentially have loads before the store of
the lockword but in this case that's not a problem
since it's known the lockword was 0 and no stores
from prior locked code could occur.
This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.
Is this sort of what's going on or is the explicit
acquire memory barrier still needed?
Joe Seigh
So if were to implement a spinlock using the above instructions
something along the lines of
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.
Loads from the locked critical region could move forward of
the stxr but there's a control dependency from cbnz branch
instruction so they would be speculative loads until the
loop exited.
You'd still potentially have loads before the store of
the lockword but in this case that's not a problem
since it's known the lockword was 0 and no stores
from prior locked code could occur.
This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.
Is this sort of what's going on or is the explicit
acquire memory barrier still needed?
Joe Seigh
On Mon, 28 Oct 2024 19:13:03 +0000, jseigh wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
..L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.
Loads from the locked critical region could move forward of
the stxr but there's a control dependency from cbnz branch
instruction so they would be speculative loads until the
loop exited.
You'd still potentially have loads before the store of
the lockword but in this case that's not a problem
since it's known the lockword was 0 and no stores
from prior locked code could occur.
This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.
Is this sort of what's going on or is the explicit
acquire memory barrier still needed?
Joe Seigh
My guess is that so few of us understand ARM fence
mechanics that we cannot address teh asked question.
So if were to implement a spinlock using the above instructions
something along the lines of
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.
This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.
Is this sort of what's going on or is the explicit
acquire memory barrier still needed?
jseigh <jseigh_es00@xemaps.com> wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.
After the store exclusive, you mean? No, it would not be necessary.
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.
Is this sort of what's going on or is the explicit
acquire memory barrier still needed?
All of the implementations of things like POSIX mutexes I've seen on
AArch64 use acquire alone.
So if were to implement a spinlock using the above instructions
something along the lines of
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The closest I could find to this was on page 8367
of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:
Loop
LDAXR W5, [X1] ; read lock with acquire
CBNZ W5, Loop ; check if 0
STXR W5, W0, [X1] ; attempt to store new value
CBNZ W5, Loop ; test if store succeeded and retry if not
On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
jseigh <jseigh_es00@xemaps.com> wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.
After the store exclusive, you mean? No, it would not be necessary.
Ahhhh! I just learned something about ARM right here. I am so used to
the acquire membar being placed _after_ the atomic logic that locks the spinlock.
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
So this acts just like a SPARC style:
atomically_lock_spinlock();
membar #LoadStore | #LoadLoad
right?
This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.
Is this sort of what's going on or is the explicit
acquire memory barrier still needed?
All of the implementations of things like POSIX mutexes I've seen on
AArch64 use acquire alone.
On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
jseigh <jseigh_es00@xemaps.com> wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
Fwiw, I am basically asking if the "store" stxr has implied acquire >semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?
Is this basically right? Or, what am I missing here? Thanks.
The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.
Am I missing anything wrt ARM? ;^o
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
jseigh <jseigh_es00@xemaps.com> wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?
Is this basically right? Or, what am I missing here? Thanks.
The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.
Am I missing anything wrt ARM? ;^o
Did you read the extensive description of memory semantics
in the ARMv8 ARM? See page 275 in DDI0487K_a.
https://developer.arm.com/documentation/ddi0487/ka/?lang=en
On 11/8/2024 2:45 PM, Scott Lurndal wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
jseigh <jseigh_es00@xemaps.com> wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?
Is this basically right? Or, what am I missing here? Thanks.
The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.
Am I missing anything wrt ARM? ;^o
Did you read the extensive description of memory semantics
in the ARMv8 ARM? See page 275 in DDI0487K_a.
https://developer.arm.com/documentation/ddi0487/ka/?lang=en
I did not! So I am flying a mostly blind here. I don't really have any experience with how ARM handles these types of things. Just guessing
that the store would honor the acquire of the load? Or, does the store
need a membar and the load does not need acquire at all? I know that the membar should be after the final store that actually locks the spinlock
wrt Joe's example.
I just need to RTFM!!!!
Sorry about that Scott. ;^o
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of pessimistic CAS RMW type of logic?
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?
LL/SC vs cmpxchg8b?
On 11/8/2024 2:45 PM, Scott Lurndal wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
jseigh <jseigh_es00@xemaps.com> wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?
Is this basically right? Or, what am I missing here? Thanks.
The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.
Am I missing anything wrt ARM? ;^o
Did you read the extensive description of memory semantics
in the ARMv8 ARM? See page 275 in DDI0487K_a.
https://developer.arm.com/documentation/ddi0487/ka/?lang=en
I did not! So I am flying a mostly blind here. I don't really have any experience with how ARM handles these types of things. Just guessing
that the store would honor the acquire of the load? Or, does the store
need a membar and the load does not need acquire at all? I know that the membar should be after the final store that actually locks the spinlock
wrt Joe's example.
I just need to RTFM!!!!
Sorry about that Scott. ;^o
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of pessimistic CAS RMW type of logic?
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?
LL/SC vs cmpxchg8b?
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?
LL/SC vs cmpxchg8b?
Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?
LL/SC vs cmpxchg8b?
Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
On 11/8/24 17:56, Chris M. Thomasson wrote:
On 11/8/2024 2:45 PM, Scott Lurndal wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
jseigh <jseigh_es00@xemaps.com> wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?
Is this basically right? Or, what am I missing here? Thanks.
The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.
Am I missing anything wrt ARM? ;^o
Did you read the extensive description of memory semantics
in the ARMv8 ARM? See page 275 in DDI0487K_a.
https://developer.arm.com/documentation/ddi0487/ka/?lang=en
I did not! So I am flying a mostly blind here. I don't really have any
experience with how ARM handles these types of things. Just guessing
that the store would honor the acquire of the load? Or, does the store
need a membar and the load does not need acquire at all? I know that
the membar should be after the final store that actually locks the
spinlock wrt Joe's example.
I just need to RTFM!!!!
Sorry about that Scott. ;^o
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?
In this case the the stxr doesn't need a memory barrier.
Loads can move forward of it but not forward of the ldaxr
because it has acquire semantics. For a lock that's ok
since the stxr would fail if any other thread acquired
the lock the conditional branch would make the loads
speculative if the stxr failed I believe.
Joe Seigh
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The closest I could find to this was on page 8367
of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:
DDI0487K_a is the most recent.
Loop
LDAXR W5, [X1] ; read lock with acquire
CBNZ W5, Loop ; check if 0
STXR W5, W0, [X1] ; attempt to store new value
CBNZ W5, Loop ; test if store succeeded and retry if not
A real world example from the linux kernel:
static __always_inline s64
__ll_sc_atomic64_dec_if_positive(atomic64_t *v)
{
s64 result;
unsigned long tmp;
asm volatile("// atomic64_dec_if_positive\n"
" prfm pstl1strm, %2\n"
"1: ldxr %0, %2\n"
" subs %0, %0, #1\n"
" b.lt 2f\n"
" stlxr %w1, %0, %2\n"
" cbnz %w1, 1b\n"
" dmb ish\n"
"2:"
: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
:
: "cc", "memory");
return result;
}
On 11/8/2024 6:19 AM, Scott Lurndal wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
A real world example from the linux kernel:
static __always_inline s64
__ll_sc_atomic64_dec_if_positive(atomic64_t *v)
{
s64 result;
unsigned long tmp;
asm volatile("// atomic64_dec_if_positive\n"
" prfm pstl1strm, %2\n"
"1: ldxr %0, %2\n"
" subs %0, %0, #1\n"
" b.lt 2f\n"
" stlxr %w1, %0, %2\n"
" cbnz %w1, 1b\n"
" dmb ish\n"
"dmb ish" is interesting to me for some reason...
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/8/2024 6:19 AM, Scott Lurndal wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
A real world example from the linux kernel:
static __always_inline s64
__ll_sc_atomic64_dec_if_positive(atomic64_t *v)
{
s64 result;
unsigned long tmp;
asm volatile("// atomic64_dec_if_positive\n"
" prfm pstl1strm, %2\n"
"1: ldxr %0, %2\n"
" subs %0, %0, #1\n"
" b.lt 2f\n"
" stlxr %w1, %0, %2\n"
" cbnz %w1, 1b\n"
" dmb ish\n"
"dmb ish" is interesting to me for some reason...
Data Memory Barrior - inner sharable coherency domain
It reads better without explanation ...
On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:
It reads better without explanation ...
Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).
Can anybody find any other example of any IBM engineer ever having a
sense of humour?
Ever?--- Synchronet 3.20a-Linux NewsLink 1.114
Anybody?
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead ofLL/SC vs cmpxchg8b?
pessimistic CAS RMW type of logic?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead ofLL/SC vs cmpxchg8b?
pessimistic CAS RMW type of logic?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Thanks, I missed that.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction?
Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC insteadLL/SC vs cmpxchg8b?
of pessimistic CAS RMW type of logic?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Thanks, I missed that.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>> pessimistic CAS RMW type of logic?LL/SC vs cmpxchg8b?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Thanks, I missed that.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence >synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>> pessimistic CAS RMW type of logic?LL/SC vs cmpxchg8b?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Thanks, I missed that.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction?
The advantage is consuming OpCode space at breathtaking speed.
Oh wait...
Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
Because the memory model was not build with the notion of memory order
and that not all ATOMIC events start or end with a recognizable inst- >ruction. Having ATOMICs announce their beginning and ending eliminates
the need for fencing; even if you keep a <relatively> relaxed memory
order model.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SCLL/SC vs cmpxchg8b?
instead of pessimistic CAS RMW type of logic?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Thanks, I missed that.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction?
The advantage is consuming OpCode space at breathtaking speed.
Oh wait...
Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
Because the memory model was not build with the notion of memory
order and that not all ATOMIC events start or end with a
recognizable inst- ruction. Having ATOMICs announce their beginning
and ending eliminates the need for fencing; even if you keep a
<relatively> relaxed memory order model.
There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.
There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:Thanks, I missed that.
Chris M. Thomasson wrote:Aarch64 also has CASP, a 128-bit atomic compare and swap
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>> pessimistic CAS RMW type of logic?LL/SC vs cmpxchg8b?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
instruction.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
"Limited ordering regions allow large systems to perform
special Load-Acquire and Store-Release instructions that
provide order between the memory accesses to a region of
the PA map as observed by a limited set of observers."
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:Thanks, I missed that.
Chris M. Thomasson wrote:Aarch64 also has CASP, a 128-bit atomic compare and swap
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>>> pessimistic CAS RMW type of logic?LL/SC vs cmpxchg8b?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
instruction.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
"Limited ordering regions allow large systems to perform
special Load-Acquire and Store-Release instructions that
provide order between the memory accesses to a region of
the PA map as observed by a limited set of observers."
Ok, so that explains LoadLOAcquire, StoreLORelease as they are
functionally different: it needs to associate the fence with specific
load and store addresses so it can determine a physical LORegion,
if any, and thereby limit the scope of the fence actions to that LOR.
But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
Why attach a specific kind of fence action to the general LD or ST?
They do the same thing in the atomic instructions, eg:
On 11/11/24 08:59, Scott Lurndal wrote:
There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.
They added the atomics for scalability allegedly. ARM never
stated what the actual issue was. I suspect they couldn't
guarantee a memory lock size small enough to eliminate
destructive interference. Like cache line size instead
of word size.
On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:
It reads better without explanation ...
Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).
Can anybody find any other example of any IBM engineer ever having a sense of humour? Ever?
On 11/11/24 08:59, Scott Lurndal wrote:
There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.
They added the atomics for scalability allegedly. ARM never
stated what the actual issue was. I suspect they couldn't
guarantee a memory lock size small enough to eliminate
destructive interference. Like cache line size instead
of word size.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:Thanks, I missed that.
Chris M. Thomasson wrote:Aarch64 also has CASP, a 128-bit atomic compare and swap
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>>>> pessimistic CAS RMW type of logic?LL/SC vs cmpxchg8b?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
instruction.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction? >>>>
The execution time of each is the same, and the main cost is the fence >>>> synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
"Limited ordering regions allow large systems to perform
special Load-Acquire and Store-Release instructions that
provide order between the memory accesses to a region of
the PA map as observed by a limited set of observers."
Ok, so that explains LoadLOAcquire, StoreLORelease as they are
functionally different: it needs to associate the fence with specific
load and store addresses so it can determine a physical LORegion,
if any, and thereby limit the scope of the fence actions to that LOR.
But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
Why attach a specific kind of fence action to the general LD or ST?
They do the same thing in the atomic instructions, eg:
Note that the atomics were added in V8.1, and were optional at that
time.
From the ARMv8 ARM:
Arm provides a set of instructions with Acquire semantics for
loads, and Release semantics for stores. These instructions
support the Release Consistency sequentially consistent (RCsc) model.
In addition, FEAT_LRCPC provides Load-AcquirePC instructions. The
combination of Load-AcquirePC and Store-Release can be use to
support the weaker Release Consistency processor consistent (RCpc) model.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
On 2024-11-10, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:
It reads better without explanation ...
Reminds me of the “EIEIO” instruction from IBM POWER (or was it only
PowerPC).
Can anybody find any other example of any IBM engineer ever having a sense >> of humour? Ever?
One of the resource types in JES2, the batch subsystem for z/OS, is
BERT ("Block Extension Reuse Table") and needs some sizing/tuning by
the sysprog. Not too noticeable as humourous but for low-level use
from Assembler some of the macros which manipulate them allow you to
(1) copy one into memory, i.e. "Deliver Or Get" a BERT
(2) define a hook to get control when a BERT is released, i.e
"Do It Later" for a BERT release.
(3) generate a control block for a related data area, i.e. a
"Collector Attribute Table" for BERTs.
These macros are
(1) $DOGBERT
(2) $DILBERT
(3) $CATBERT
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
All this, and much more can be discovered by reading the AMBA
specifications. However, the main point is that the content of the
target address does not have to be transferred to the local cache:
these are remote atomic operations. Quite nice for things like fire-and-forget counters, for example.
The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
On 11/11/2024 6:56 AM, jseigh wrote:
On 11/11/24 08:59, Scott Lurndal wrote:
There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.
They added the atomics for scalability allegedly. ARM never
stated what the actual issue was. I suspect they couldn't
guarantee a memory lock size small enough to eliminate
destructive interference. Like cache line size instead
of word size.
For some reason it reminds me of the size of a reservation granule wrt LL/SC.
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
All this, and much more can be discovered by reading the AMBA
specifications. However, the main point is that the content of the
target address does not have to be transferred to the local cache:
these are remote atomic operations. Quite nice for things like fire-and-forget counters, for example.
The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Andrew.
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Andrew.
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Andrew.--- Synchronet 3.20a-Linux NewsLink 1.114
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad.
LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
On 11/12/2024 3:02 PM, aph@littlepinkcloud.invalid wrote:
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a >>>> STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad.
Ahhh. So, well, it makes me think of the implied StoreLoad in x86/x64 LOCK'ed RMW's...? Does this make any sense to you? Or, am I wandering
around in a damn field somewhere! ;^o
I am so used to SPARC style in RMO mode. The LoadStore should be _after_
any "naked", but atomic logic that acquires and releases a
spinlock...
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
On 11/12/2024 6:14 AM, aph@littlepinkcloud.invalid wrote:
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
All this, and much more can be discovered by reading the AMBA
specifications. However, the main point is that the content of the
target address does not have to be transferred to the local cache:
these are remote atomic operations. Quite nice for things like
fire-and-forget counters, for example.
I ended up mostly with a simpler model, IMO:
Normal / RAM-like: Fetch cache line, write back when evicting;
Operations: LoadTile, StoreTile, SwapTile,
LoadPrefetch, StorePrefetch
Volatile (RAM like): Fetch, operate, write-back;
MMIO: Remote Load/Store/Swap request;
Operation is performed on target;
Currently only supports DWORD and QWORD access;
Operations are strictly sequential.
In theory, MMIO access could be added to RAM, but unclear if worth the
added cost and complexity of doing so. Could more easily enforce strict consistency.
The LoadPrefetch and StorePrefetch operations:
LoadPrefetch, try to perform a load from RAM
Always responds immediately
Signals whether it was an L2 hit or L2 Miss.
StorePrefetch
Basically like LoadPrefetch
Signals that the intention is to write to memory.
In my cache and bus design, I sometimes refer to cache lines as "tiles" partly because of how I viewed them as operating, which didn't exactly
match the online descriptions of cache lines.
Say:
Tile:
16 bytes in the current implementation.
Accessed in even and odd rows
A memory access may span an even tile and an odd tile;
The L1 caches need to have a matched pair of tiles for an access.
Cache Line:
Usually described as always 32 bytes;
Descriptions seemed to assume only a single row of lines in caches.
Generally no mention of allowing for an even/odd scheme.
Seemingly, a cache that operated with cache lines would use a single row
of 32-bit cache lines, with misaligned accesses presumably spanning a
pair of adjacent cache lines. To fit with BRAM access patterns, would
likely need to split lines in half, and then mirror the relevant tag
bits (to allow detecting hit/miss).
However, online descriptions generally made no mention of how misaligned accesses were intended to be handled within the limits of a dual-ported
RAM (1R1W).
My L2 cache operates in a way more like that of traditional descriptions
of cache lines, except that they are currently 64 bytes in my L2 cache
(and internally subdivided into four 16-byte parts).
The use of 64 bytes was mostly because this size got the most bandwidth
with my DDR interface (with 16 or 32 byte transfers, more cycles are
spent overhead; however latency was lower).
In this case, the L2<->RAM interface:
512 bit Load Data
512 bit Store Data
Load Address
Store Address
Request Code (IDLE/LOAD/STORE/SWAP)
Request Sequence Number
Response Code (READY/OK/HOLD/FAIL)
Response Sequence Number
Originally, there were no sequence numbers, and IDLE/READY signaling was used between each request (needed to return to this state before
starting a new request). The sequence numbers avoided needing to return
to an IDLE/READY state, allowing the bandwidth over this interface to be nearly doubled.
In a SWAP request, the Load and Store are performed end to end.
General bandwidth for a 16-bit DDR2 chip running at 50MHz (DLL disabled, effectively a low-power / standby mode) is ~ 90 MB/sec (or 47 MB/s each direction for SWAP), which is fairly close to the theoretical limit (internally, the logic for the DDR controller runs at 100MHz, driving IO
as 100MHz SDR, albeit using both posedge and negedge for sampling
responses from the DDR chip, so ~ 200 MHz if seen as SDR).
Theoretically, would be faster to access the chip using the SERDES interface, but:
Hadn't gone up the learning curve for this;
Unclear if I could really effectively utilize the bandwidth with a 50MHz
CPU and my current bus;
Actual bandwidth gains would be smaller, as then CAS and RAS latency
would dominate.
Could in theory have used Vivado MIG, but then I would have needed to
deal with AXI, and never crossed the threshold of wanting to deal with AXI.
Between CPU, L2, and various other devices, I am using a ringbus:
Connections:
128 bits data;
48 bits address (96 bits between L1 caches and TLB);
16 bits: request/response code and flags;
16 bits: source/dest node and request sequence number;
Each node has a set of input and output connections;
Each node may modify a request/response,
or simply forward from input to output.
Messages move along at one position per clock cycle.
Generally also 50 MHz at present (*1).
*1: Pretty much everything (apart from some hardware interfaces) runs on
the same clock. Some devices needed faster clocks. Any slower clocks
were generally faked using accumulator dividers (add a fraction every clock-cycle and use the MSB of the accumulator as the virtual clock).
Comparably, the per-node logic cost isn't too high, nor is the logic complexity. However, performance of the ring is very sensitive to ring latency (and there are some amount of hacks to try to reduce the overall latency of the ring in common paths).
At present, the highest resolution video modes that can be managed semi- effectively are 640x400 and 640x480 256-color (60Hz), or ~ 20 MB/sec.
Can do 800x600 or similar in RGBI or color-cell modes (640x400 or
640x480 CC also being an option). Theoretically, there is a 1024x768 monochrome mode, but this is mostly untested. The 4-color and monochrome modes had optional Bayer-pattern sub-modes to mimic full color.
Main modes I have ended up using:
80x25 and 80x50 text/color-cell modes;
Text and color cell graphics exist in the same mode.
320x200 hi-color (RGB555);
640x400 indexed 256 color.
Trying to go much higher than this, and the combination of ringbus
latency and L2 misses turns the display into a broken mess (with a DRAM backed framebuffer). Originally, I had the framebuffer in Block-RAM, but this in turn set the hard-limit based on framebuffer size (and putting framebuffer in DRAM allowing for a bigger L2 cache).
Theoretically, could allow higher resolution modes by adding a fast path between the display output and DDR RAM interface (with access then being multiplexed with the L2 cache). Have not done so.
Or, possible but more radical:
Bolt the VGA output module directly to the L2 cache;
Could theoretically do 800x600 high-color
Would eat around 2/3 of total RAM bandwidth.
Major concern here is that setting resolutions too high would starve the
CPU of the ability to access memory (vs the current situation where
trying to set higher resolutions mostly results in progressively worse display glitches).
Logic would need to be in place so that display can't totally hog the
RAM interface. If doing so, may also make sense to move from color-cell
and block-organized memory to fully raster oriented frame-buffers.
Though, despite being more wonky / non-standard, the block-oriented framebuffer layout has tended to play along better with memory fetch. A raster oriented framebuffer is much more sensitive to timing and access- latency issues compared with 4x4 or 8x8 pixel blocks, with the display working on an internal cache of around 2 .. 4 rows of blocks.
Raster generally needs results to be streamed in-order and at a
consistent latency, whereas blocks can use hit/miss handling, with a hit/miss probe running ahead of the current raster position (and
hopefully able to get the block fetched before it is time to display
it). Though, did add logic to the display to avoid sensing new prefetch requests for a block if it is still waiting for a response on that block (mostly as otherwise the VRAM cache was spamming the ringbus with
excessive prefetch requests). Where, in this case, the VRAM was using exclusively prefetch requests during screen refresh.
In practice, it may not matter that much if the hardware framebuffer is block-ordered rather than raster. The OS's display driver is the only
thing that really needs to care. Main case where it could arguably
"actually matter" being full-screen programs using a DirectDraw style interface, but likely doesn't matter that much if the program is given
an off-screen framebuffer to draw into rather than the actual hardware framebuffer (with the contents being copied over during a "buffer swap" event).
But, as noted, I was mostly using a partly GDI+VfW inspired interface,
which seems "mostly OK". Difference in overhead isn't that large; and
"Draw this here bitmap onto this HDC" offers a certain level of hardware abstraction; as there is no implicit assumption that the pixel format in ones' bitmap object needs to match the format and layout of the display device.
Nevermind if for GUI like operation, programs/windows were mostly
operating in hi-color, with stuff being awkwardly converted to 256 color during window-stack redraw. Granted, pretty sure old-style Windows
didn't work this way, and per-window framebuffers eat a lot of RAM (note that the shell had tabs, but all the tabs share a single window
framebuffer; rather each has a separate character cell buffer, and the
cells are redrawn to the window buffer either when switching tabs or
when more text is printed).
Had considered option for 256-color or 16 color window buffers (to save RAM), but haven't done so yet (for now, if drawing a 16 or 256 color
bitmap, it is internally converted to hi-color). More likely, would
switch to 256 color window buffers if using a 256 color output mode (so conversion to 256 color would be handled when drawing the bitmap to the window, rather than later in the process).
Well, I guess sort of similar wonk that the internal audio mixing is
using 16-bit PCM, whereas the output is A-Law (for the hardware loop buffer). But, a case could be made for doing the OS level audio mixing
as Binary16.
Either way, longer term future of my project is uncertain...
And, unclear if a success or failure.
It mostly did about as well as I could expect.
Never did achieve my original goal of "fast enough to run Quake at
decent framerates", but partly because younger self didn't realize everything would be stuck at 50 MHz (or that 75 or 100 MHz core would
end up needing to be comparably anemic scalar RISC cores; which still
can't really get good framerates in Quake, *).
*: A 100 MHz RV64G core still will not make Quake fast. Extra not helped
if one needs to use smaller L1 caches and generate stalls on memory RAW hazards, ...
Sadly, closest I have gotten thus far involves GLQuake and a hardware rasterizer module. And then the problem that performance is more limited
by trying to get geometry processed and fed into the module, than its ability to walk edges and rasterize stuff. Would seemingly ideally need something that could also perform transforms and deal with perspective- correct texture filtering (vs CPU side transforms, and dynamic
subdivision + affine texture filtering).
...
The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Andrew.
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
On 11/12/2024 3:02 PM, aph@littlepinkcloud.invalid wrote:
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a >>>> STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad.
LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Iirc, a sequential membar was strange on SPARC. I have seen things like
this before wrt RMO mode:
membar #StoreLoad | #LoadStore | #LoadLoad | #StoreStore
shit. It's been a while! damn.
The execution time of each is the same, and the main cost is the fence >>>> synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Andrew.
Humm... It makes me think of, well... does an atomic RMW have implied membars, or are they completely separated akin to the SPARC membar instruction? LOCK'ed RMW on Intel, XCHG instruction aside wrt its
implied LOCK prefix, well, they are StoreLoad! Shit.
To me brilliant is something that still isn't obvious after larning
about it.
On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:
That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Does ARM use acquire and release differently than everyone else?
I'm not sure where StoreLoad fits in with those.
aph@littlepinkcloud.invalid wrote:
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Isn't this just reusing the normal forwarding network?
If not found, you do as usual and start a regular load operation, but
now you also know that you can skip the flushing of the same?
PS. I do agree that it is a good idea (even patent-worthy?), but not brilliant since it is so very obvious in hindsight.
To me brilliant is something that still isn't obvious after larning
about it.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
To me brilliant is something that still isn't obvious after larning
about it.
Why do you think it's less brilliant to recognize something obvious
that everybody else has overlooked?
- anton
jseigh <jseigh_es00@xemaps.com> wrote:
On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:
That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Does ARM use acquire and release differently than everyone else?
I'm not sure where StoreLoad fits in with those.
Yes. LDAR and STLR, used together, are sequentially consistent. This
is a stronger guarantee than acquire and release.
Do read B2.3 Definition of the Arm memory model. It's only 32 pages,
and very clearly defines the memory model.
Even better, let's look at the actual words for Pick Basic Dependency:
---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory
from E1 to E2
b) E1 and E2 are the same effect
Terje Mathisen <terje.mathisen@tmsw.no> writes:
To me brilliant is something that still isn't obvious after larning
about it.
Why do you think it's less brilliant to recognize something obvious
that everybody else has overlooked?
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
aph@littlepinkcloud.invalid wrote:
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Isn't this just reusing the normal forwarding network?
If not found, you do as usual and start a regular load operation, but
now you also know that you can skip the flushing of the same?
Yes. As long as the data in the store buffer doesn't overlap with what
you're about to write, you can ship the flushing.
PS. I do agree that it is a good idea (even patent-worthy?), but not
brilliant since it is so very obvious in hindsight.
LOL! :-)
To me brilliant is something that still isn't obvious after larning
about it.
You have very high standards.
aph@littlepinkcloud.invalid wrote:
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
aph@littlepinkcloud.invalid wrote:
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't >>>> necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Isn't this just reusing the normal forwarding network?
If not found, you do as usual and start a regular load operation, but
now you also know that you can skip the flushing of the same?
Yes. As long as the data in the store buffer doesn't overlap with what
you're about to write, you can ship the flushing.
PS. I do agree that it is a good idea (even patent-worthy?), but not
brilliant since it is so very obvious in hindsight.
LOL! :-)
To me brilliant is something that still isn't obvious after larning
about it.
You have very high standards.
That is one of the reasons I never started a PhD track, I could never
find an area of study that I thought would be sufficiently ground-breaking.
The other reason is/was that my friend Andy "Crazy" Glew did try the PhD route for several years and hit the same stumbling block vs his
advisors, and I know that Andy is an idea machine well beyond myself.
Kent Dickey <kegs@provalid.com> wrote:
Even better, let's look at the actual words for Pick Basic Dependency:
---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory >> from E1 to E2
b) E1 and E2 are the same effect
I don't understand this. However, here are the actual words:
Pick Basic dependency
A Pick Basic dependency from a read Register effect or read Memory
effect R1 to a Register effect or Memory effect E2 exists if one
of the following applies:
• There is a Dependency through registers and memory from R1 to E2.
• There is an Intrinsic Control dependency from R1 to E2.
• There is a Pick Basic dependency from R1 to an Effect E3 and
there is a Pick Basic dependency from E3 to E2.
Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.
On 11/14/2024 1:23 AM, aph@littlepinkcloud.invalid wrote:
Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.
data dependencies as in stronger than a Dec Alpha does not not honor
data dependent loads?
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/14/2024 1:23 AM, aph@littlepinkcloud.invalid wrote:
Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.
data dependencies as in stronger than a Dec Alpha does not not honor
data dependent loads?
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:
It reads better without explanation ...
Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).
Can anybody find any other example of any IBM engineer ever having a sense
of humour? Ever?
Anybody?
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
aph@littlepinkcloud.invalid writes:[...]
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."
aph@littlepinkcloud.invalid writes:
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."
On 11/14/2024 11:25 PM, Anton Ertl wrote:
aph@littlepinkcloud.invalid writes:
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it[...]
just the Alpha specification? I certainly think that Alpha's lack
of guarantees in memory ordering is a bad idea, and so is ARM's:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously? Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in
the order specified by its program."
Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not
need this. Iirc, akin to memory_order_consume in C++:
https://en.cppreference.com/w/cpp/atomic/memory_order
data dependent loads
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of >guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."
However, I don't think that the Alpha architects considered the Alpha
memory ordering to be an error, and probably still don't, just like
the ARM architects don't consider their memory model to be an error.
I am pretty sure that no Alpha implementation ever made use of the
lack of causality in the Alpha memory model, so they could have added >causality without outlawing existing implementations. That they did
not indicates that they thought that their memory model was right. An >advocacy paper for weak memory models [adve&gharachorloo95] came from
the same place as Alpha, so it's no surprise that Alpha specifies weak >consistency.
aph@littlepinkcloud.invalid writes:
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."
However, I don't think that the Alpha architects considered the Alpha
memory ordering to be an error, and probably still don't, just like
the ARM architects don't consider their memory model to be an error.
I am pretty sure that no Alpha implementation ever made use of the
lack of causality in the Alpha memory model, so they could have added causality without outlawing existing implementations. That they did
not indicates that they thought that their memory model was right. An advocacy paper for weak memory models [adve&gharachorloo95] came from
the same place as Alpha, so it's no surprise that Alpha specifies weak consistency.
@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo},
title = {Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
annote = {Gives an overview of architectural features of
shared-memory computers such as independent memory
banks and per-CPU caches, and how they make the (for
programmers) most natural consistency model hard to
implement, giving examples of programs that can fail
with weaker consistency models. It then discusses
several categories of weaker consistency models and
actual consistency models in these categories, and
which ``safety net'' (e.g., memory barrier
instructions) programmers need to use to work around
the deficiencies of these models. While the authors
recognize that programmers find it difficult to use
these safety nets correctly and efficiently, it
still advocates weaker consistency models, claiming
that sequential consistency is too inefficient, by
outlining an inefficient implementation (which is of
course no proof that no efficient implementation
exists). Still the paper is a good introduction to
the issues involved.}
}
- anton
On Fri, 15 Nov 2024 07:25:12 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
aph@littlepinkcloud.invalid writes:
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of
guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."
Of course, it's not enough for SC.
What you said holds, for example, for TSO and even by some memory
ordering models that a weaker than TSO.
The points of SC is that in addition to that it requires for any two
stores by different agents to be observed in the same order by all
agents in the system, including those two.
Anybody doing that sort of programming, i.e. lock-free or distributed >algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
Strongly consistent memory won't help incompetence.
On Fri, 15 Nov 2024 03:17:22 -0800
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:
On 11/14/2024 11:25 PM, Anton Ertl wrote:
aph@littlepinkcloud.invalid writes:[...]
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack
of guarantees in memory ordering is a bad idea, and so is ARM's:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in
the order specified by its program."
Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not
need this. Iirc, akin to memory_order_consume in C++:
https://en.cppreference.com/w/cpp/atomic/memory_order
data dependent loads
You response does not answer Anton's question.
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
Do you have any argument that supports this claim.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
Do you have any argument that supports this claim.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
- anton
On 11/15/2024 11:27 AM, Anton Ertl wrote:[...]
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
Do you have any argument that supports this claim.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
In my case, as I see it:
The tradeoff is more about implementation cost, performance, etc.
Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
...
On 11/15/2024 12:53 PM, BGB wrote:
On 11/15/2024 11:27 AM, Anton Ertl wrote:[...]
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't >>>> be doing that sort of programming in the first place.
Do you have any argument that supports this claim.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
In my case, as I see it:
The tradeoff is more about implementation cost, performance, etc.
Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
...
A TSO from a weak memory model is as it is. It should not necessarily perform "worse" than other systems that have TSO as a default. The
weaker models give us flexibility. Any weak memory model should be able
to give sequential consistency via using the right membars in the right places.
On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:
On 11/15/2024 12:53 PM, BGB wrote:
On 11/15/2024 11:27 AM, Anton Ertl wrote:[...]
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.
Do you have any argument that supports this claim.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
In my case, as I see it:
The tradeoff is more about implementation cost, performance, etc.
Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
...
A TSO from a weak memory model is as it is. It should not necessarily
perform "worse" than other systems that have TSO as a default. The
weaker models give us flexibility. Any weak memory model should be able
to give sequential consistency via using the right membars in the right
places.
The speed difference is mostly that, in a weak model, the L1 cache
merely needs to fetch memory from the L2 or similar, may write to it whenever, and need not proactively store back results.
As I understand it, a typical TSO like model will require, say:
Any L1 cache that wants to write to a cache line, needs to explicitly
request write ownership over that cache line;
Any attempt by other cores to access this line,
may require the L2 cache
to send a message to the core currently holding the cache line for
writing to write back its contents, with the request unable to be
handled until after the second core has written back the dirty cache
line.
This would create potential for significantly more latency in cases
where multiple cores touch the same part of memory; albeit the cores
will see each others' memory stores.
So, initially, weak model can be faster due to not needing any
additional handling.
But... Any synchronization points, such as a barrier or locking or
releasing a mutex, will require manually flushing the cache with a weak model.
And, locking/releasing the mutex itself will require a mechanism
that is consistent between cores (such as volatile atomic swaps or
similar, which may still be weak as a volatile-atomic-swap would still
not be atomic from the POV of the L2 cache; and an MMIO interface could
be stronger here).
Seems like there could possibly be some way to skip some of the cache flushing if one could verify that a mutex is only being locked and
unlocked on a single core.
Issue then is how to deal with trying to lock a mutex which has thus far
been exclusive to a single core. One would need some way for the core
that last held the mutex to know that it needs to perform an L1 cache
flush.
Though, one possibility could be to leave this part to the OS scheduler/syscall/...
mechanism; so the core that wants to lock the--- Synchronet 3.20a-Linux NewsLink 1.114
mutex signals its intention to do so via the OS, and the next time the
core that last held the mutex does a syscall (or tries to lock the mutex again), the handler sees this, then performs the L1 flush and flags the
mutex as multi-core safe (at which point, the parties will flush L1s at
each mutex lock, though possibly with a timeout count so that, if the
mutex has been single-core for N locks, it reverts to single-core
behavior).
This could reduce the overhead of "frivolous mutex locking" in programs
that are otherwise single-threaded or single processor (leaving the
cache flushes for the ones that are in-fact being used for
synchronization purposes).
....
On Fri, 15 Nov 2024 23:35:22 +0000, BGB wrote:
On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:
On 11/15/2024 12:53 PM, BGB wrote:
On 11/15/2024 11:27 AM, Anton Ertl wrote:[...]
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed >>>>>> algorithms, who can't handle weakly consistent memory models,
shouldn't
be doing that sort of programming in the first place.
Do you have any argument that supports this claim.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
In my case, as I see it:
The tradeoff is more about implementation cost, performance, etc. >>>>
Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
...
A TSO from a weak memory model is as it is. It should not necessarily
perform "worse" than other systems that have TSO as a default. The
weaker models give us flexibility. Any weak memory model should be able
to give sequential consistency via using the right membars in the right
places.
The speed difference is mostly that, in a weak model, the L1 cache
merely needs to fetch memory from the L2 or similar, may write to it
whenever, and need not proactively store back results.
As I understand it, a typical TSO like model will require, say:
Any L1 cache that wants to write to a cache line, needs to explicitly
request write ownership over that cache line;
The cache line may have been fetched from a core which modified the
data, and handed this line directly to this requesting core on a
typical read. So, it is possible for the line to show up with
write permission even if the requesting core did not ask for write permission. So, not all lines being written have to request owner-
ship.
Any attempt by other cores to access this line,
You are being rather loose with your time analysis in this question::
Access this line before write permission has been requested,
or
Access this line after write permission has been requested but
before it has arrived,
or
Access this line after write permission has arrived.
may require the L2 cache
to send a message to the core currently holding the cache line for
writing to write back its contents, with the request unable to be
handled until after the second core has written back the dirty cache
line.
L2 has to know something about how L1 has the line, and likely which
core cache the data is in.
This would create potential for significantly more latency in cases
where multiple cores touch the same part of memory; albeit the cores
will see each others' memory stores.
One can ARGUE that this is a good thing as it makes latency part
of the memory access model. More interfering accesses=higher
latency.
So, initially, weak model can be faster due to not needing any
additional handling.
But... Any synchronization points, such as a barrier or locking or
releasing a mutex, will require manually flushing the cache with a weak
model.
Not necessarily:: My 66000 uses causal memory consistency, yet when
an ATOMIC event begins it reverts to sequential consistency until
the end of the event where it reverts back to causal. Use of MMI/O
space reverts to sequential consistency, while access to config
space reverts all the way back to strongly ordered.
And, locking/releasing the mutex itself will require a mechanism
that is consistent between cores (such as volatile atomic swaps or
similar, which may still be weak as a volatile-atomic-swap would still
not be atomic from the POV of the L2 cache; and an MMIO interface could
be stronger here).
Seems like there could possibly be some way to skip some of the cache
flushing if one could verify that a mutex is only being locked and
unlocked on a single core.
Issue then is how to deal with trying to lock a mutex which has thus far
been exclusive to a single core. One would need some way for the core
that last held the mutex to know that it needs to perform an L1 cache
flush.
This seems to be a job for Cache Consistency.
Though, one possibility could be to leave this part to the OS
scheduler/syscall/...
The OS wants nothing to do with this.
mechanism; so the core that wants to lock the
mutex signals its intention to do so via the OS, and the next time the
core that last held the mutex does a syscall (or tries to lock the mutex
again), the handler sees this, then performs the L1 flush and flags the
mutex as multi-core safe (at which point, the parties will flush L1s at
each mutex lock, though possibly with a timeout count so that, if the
mutex has been single-core for N locks, it reverts to single-core
behavior).
This could reduce the overhead of "frivolous mutex locking" in programs
that are otherwise single-threaded or single processor (leaving the
cache flushes for the ones that are in-fact being used for
synchronization purposes).
....
On 11/15/2024 9:27 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
For instance, a 100% sequential memory order won't help you with, say, >solving ABA.
The tradeoff is more about implementation cost, performance, etc.
Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:...
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification?
Perhaps one might ask Dr. Kessler?
https://acg.cis.upenn.edu/milom/cis501-Fall09/papers/Alpha21264.pdf
On 11/15/2024 5:24 AM, Michael S wrote:
On Fri, 15 Nov 2024 03:17:22 -0800
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:
On 11/14/2024 11:25 PM, Anton Ertl wrote:
aph@littlepinkcloud.invalid writes:[...]
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack
of guarantees in memory ordering is a bad idea, and so is ARM's:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in
the order specified by its program."
Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not
need this. Iirc, akin to memory_order_consume in C++:
https://en.cppreference.com/w/cpp/atomic/memory_order
data dependent loads
You response does not answer Anton's question.
I guess not. Shit happens. ;^o
On 11/15/2024 12:42 PM, Chris M. Thomasson wrote:
On 11/15/2024 5:24 AM, Michael S wrote:
On Fri, 15 Nov 2024 03:17:22 -0800
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:
On 11/14/2024 11:25 PM, Anton Ertl wrote:
aph@littlepinkcloud.invalid writes:[...]
Yes. That Alpha behaviour was a historic error. No one wants to do >>>>>> that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack
of guarantees in memory ordering is a bad idea, and so is ARM's:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously? >>>>> Sequential consistency can be specified in one sentence: "The result >>>>> of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in
the order specified by its program."
Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not
need this. Iirc, akin to memory_order_consume in C++:
https://en.cppreference.com/w/cpp/atomic/memory_order
data dependent loads
You response does not answer Anton's question.
I guess not. Shit happens. ;^o
Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.
Hybrid:
Try to have a fast-path in userland ("local core only" mutexes);
Fall back to syscalls if not fast-path.
@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo}, title =
{Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
...
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/15/2024 9:27 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't >>>> be doing that sort of programming in the first place.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
For instance, a 100% sequential memory order won't help you with, say,
solving ABA.
Sure, not all problems are solved by sequential consistency, and yes,
it won't solve race conditions like the ABA problem. But jseigh
implied that finding it easier to write correct and efficient code for sequential consistency than for a weakly-consistent memory model
(e.g., Alphas memory model) is incompetent.
Yeah, I have absolutely no issue with ideas that are only obvious in hindsight, they deserve praise. My real problem are those things that
are new, but only because of the environment, as in the idea would be obvious to anyone "versed in the field".
I.e US vs Norwegian (European?) patent law.
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
Do you have any argument that supports this claim.
Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.
On 11/15/2024 11:37 PM, Anton Ertl wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/15/2024 9:27 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
For instance, a 100% sequential memory order won't help you with, say,
solving ABA.
Sure, not all problems are solved by sequential consistency, and yes,
it won't solve race conditions like the ABA problem. But jseigh
implied that finding it easier to write correct and efficient code for
sequential consistency than for a weakly-consistent memory model
(e.g., Alphas memory model) is incompetent.
What if you had to write code for a weakly ordered system, and the >performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using >std::memory_order_seq_cst", well, that is a violation right off the bat.
Fair enough?
Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.
Or maybe disable reordering or optimization altogether
for those target architectures.
On 11/16/24 16:21, Chris M. Thomasson wrote:
Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction.
Iirc, mb for alpha? Cannot remember that one right now.
That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.
Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect. Or maybe disable reordering or optimization altogether
for those target architectures.
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/15/2024 11:37 PM, Anton Ertl wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/15/2024 9:27 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed >>>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>>> be doing that sort of programming in the first place.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
For instance, a 100% sequential memory order won't help you with, say, >>>> solving ABA.
Sure, not all problems are solved by sequential consistency, and yes,
it won't solve race conditions like the ABA problem. But jseigh
implied that finding it easier to write correct and efficient code for
sequential consistency than for a weakly-consistent memory model
(e.g., Alphas memory model) is incompetent.
What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using
std::memory_order_seq_cst", well, that is a violation right off the bat.
Fair enough?
Are you trying to support my point?
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:I am trying to say you might not be hired if you only knew how to handle >std::memory_order_seq_cst wrt C++... ?
What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using
std::memory_order_seq_cst", well, that is a violation right off the bat. ...
On 11/16/24 16:21, Chris M. Thomasson wrote:
Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.
That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.
Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.
aph@littlepinkcloud.invalid writes:
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification?
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
..."Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using
std::memory_order_seq_cst", well, that is a violation right off the bat.
I am trying to say you might not be hired if you only knew how to handle
std::memory_order_seq_cst wrt C++... ?
I am not looking to be hired.
In any case, this cuts both ways: If you are an employer working on multi-threaded software, say, for Windows or Linux, will you reduce
your pool of potential hires by including a requirement like the one
above? And then pay for longer development time and additional
hard-to-find bugs coming from overshooting the requirement you stated
above. Or do you limit your software support to TSO hardware (for
lack of widely available SC hardware), and gain all the benefits of
more potential hires, reduced development time, and fewer bugs?
I have compared arguments against strong memory ordering with those
against floating-point. Von Neumann argued for fixed point as follows <https://booksite.elsevier.com/9780124077263/downloads/historial%20perspectives/section_3.11.pdf>:
|[...] human time is consumed in arranging for the introduction of
|suitable scale factors. We only argue that the time consumed is a
|very small percentage of the total time we will spend in preparing an |interesting problem for our machine. The first advantage of the
|floating point is, we feel, somewhat illusory. In order to have such
|a floating point, one must waste memory capacity which could
|otherwise be used for carrying more digits per word.
Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:
|Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
|seemed to imply that 40-bit arithmetic would hardly ever deliver
|usable accuracy for the solution of so few as 100 linear equations in
|100 unknowns; but by 1954 engineers were solving bigger systems
|routinely and getting satisfactory accuracy from arithmetics with no
|more than 40 bits.
The flaw in the reasoning of the paper was:
|To solve it more easily without floating–point von Neumann had |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
|doubling the number of sig. bits lost to ill-condition
This is an example of how the supposed gains that the harder-to-use
interface provides (in this case the bits "wasted" on the exponent)
are overcompensated by then having to use a software workaround for
the harder-to-use interface.
On 11/17/2024 11:11 PM, Anton Ertl wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
..."Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely >>>>> have to. If you say something akin to "I do everything using
std::memory_order_seq_cst", well, that is a violation right off the >>>>> bat.
I am trying to say you might not be hired if you only knew how to handle >>> std::memory_order_seq_cst wrt C++... ?
I am not looking to be hired.
In any case, this cuts both ways: If you are an employer working on
multi-threaded software, say, for Windows or Linux, will you reduce
your pool of potential hires by including a requirement like the one
above? And then pay for longer development time and additional
hard-to-find bugs coming from overshooting the requirement you stated
above. Or do you limit your software support to TSO hardware (for
lack of widely available SC hardware), and gain all the benefits of
more potential hires, reduced development time, and fewer bugs?
I have compared arguments against strong memory ordering with those
against floating-point. Von Neumann argued for fixed point as follows
<https://booksite.elsevier.com/9780124077263/downloads/
historial%20perspectives/section_3.11.pdf>:
|[...] human time is consumed in arranging for the introduction of
|suitable scale factors. We only argue that the time consumed is a
|very small percentage of the total time we will spend in preparing an
|interesting problem for our machine. The first advantage of the
|floating point is, we feel, somewhat illusory. In order to have such
|a floating point, one must waste memory capacity which could
|otherwise be used for carrying more digits per word.
Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:
|Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
|seemed to imply that 40-bit arithmetic would hardly ever deliver
|usable accuracy for the solution of so few as 100 linear equations in
|100 unknowns; but by 1954 engineers were solving bigger systems
|routinely and getting satisfactory accuracy from arithmetics with no
|more than 40 bits.
The flaw in the reasoning of the paper was:
|To solve it more easily without floating–point von Neumann had
|transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
|doubling the number of sig. bits lost to ill-condition
This is an example of how the supposed gains that the harder-to-use
interface provides (in this case the bits "wasted" on the exponent)
are overcompensated by then having to use a software workaround for
the harder-to-use interface.
well, if you used std::memory_order_seq_cst to implement, say, a mutex and/or spinlock memory barrier logic, well, that would raise a red flag
in my mind... Not good.
jseigh <jseigh_es00@xemaps.com> writes:
Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.
That's something between the user of a programming language and the
compiler. If you use a programming language or compiler that gives
weaker memory ordering guarantees than the architecture it compiles
to, that's your choice. Nothing forces compilers to behave that way,
and it's actually easier to write compilers that do not do such
reordering.
Or maybe disable reordering or optimization altogether
for those target architectures.
So you want to throw out the baby with the bathwater.
On 11/17/2024 6:03 AM, jseigh wrote:
On 11/16/24 16:21, Chris M. Thomasson wrote:
Fwiw, in C++ std::memory_order_consume is useful for traversing a
node based stack of something in RCU. In most systems it only acts
like a compiler barrier. On the Alpha, it must emit a membar
instruction. Iirc, mb for alpha? Cannot remember that one right now.
That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.
Strange! C++ has:
https://en.cppreference.com/w/cpp/atomic/atomic_signal_fence
That only deals with compilers, not the arch memory order... Humm...
Interesting Joe!
Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect. Or maybe disable reordering or optimization altogether
for those target architectures.
Indeed. The compiler needs to know about these things. Iirc, there was
an old post over c.p.t that deals with a compiler (think it was GCC)
that messed up a pthread try lock for a mutex. It's a very old post. But
I remember it for sure.
On 11/16/24 16:21, Chris M. Thomasson wrote:
Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction.
Iirc, mb for alpha? Cannot remember that one right now.
That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.
Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.
Or maybe disable reordering or optimization altogether^^^^^^^^^^^^
for those target architectures.
On 11/16/24 16:21, Chris M. Thomasson wrote:
Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction.
Iirc, mb for alpha? Cannot remember that one right now.
That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.
Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect. Or maybe disable reordering or optimization altogether
for those target architectures.
Joe Seigh
Kent Dickey <kegs@provalid.com> wrote:
Even better, let's look at the actual words for Pick Basic Dependency:
---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory
from E1 to E2
b) E1 and E2 are the same effect
I don't understand this. However, here are the actual words:
Pick Basic dependency
A Pick Basic dependency from a read Register effect or read Memory
effect R1 to a Register effect or Memory effect E2 exists if one
of the following applies:
• There is a Dependency through registers and memory from R1 to E2.
• There is an Intrinsic Control dependency from R1 to E2.
• There is a Pick Basic dependency from R1 to an Effect E3 and
there is a Pick Basic dependency from E3 to E2.
Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.
Andrew.
In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>, <aph@littlepinkcloud.invalid> wrote:
Kent Dickey <kegs@provalid.com> wrote:
Even better, let's look at the actual words for Pick Basic Dependency:
---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory >>> from E1 to E2
b) E1 and E2 are the same effect
I don't understand this. However, here are the actual words:
Pick Basic dependency
A Pick Basic dependency from a read Register effect or read Memory
effect R1 to a Register effect or Memory effect E2 exists if one
of the following applies:
. There is a Dependency through registers and memory from R1 to E2. >> . There is an Intrinsic Control dependency from R1 to E2.
. There is a Pick Basic dependency from R1 to an Effect E3 and
there is a Pick Basic dependency from E3 to E2.
Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.
Where did you get that from? I cannot find it in the current Arm document DDI0487K_a_a-profile-architecture_reference_manual.pdf. Get it from https://developer.arm.com/documentation/ddi0487/ka/?lang=en
My text for Pick Basic dependency is a quote (where I label the lines
1a,1b, etc., where it's just bullets in the Arm document) from page B2-239, middle of the page.
That sort of "summary" was exactly what I was asking for, but I don't see it, so can you please name the page?
I'm pretty sure there are confusing typos all through this section
(E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
was a doozy.
It's likely the wording was better in an earlier document, I've noticed
this section getting more opaque over time.
Kent Dickey <kegs@provalid.com> wrote:
In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>,
<aph@littlepinkcloud.invalid> wrote:
Kent Dickey <kegs@provalid.com> wrote:
Even better, let's look at the actual words for Pick Basic Dependency:
That sort of "summary" was exactly what I was asking for, but I don't see it,
so can you please name the page?
B2-174 in DDI0487J
I'm pretty sure there are confusing typos all through this section
(E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
was a doozy.
It's likely the wording was better in an earlier document, I've noticed
this section getting more opaque over time.
So it seems. I think everything in DDI0487J was meant to be there in >DDI0487K, but it looks like it's all been macro-expanded and some
things fell off the page, because reasons.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 991 |
Nodes: | 10 (0 / 10) |
Uptime: | 119:51:49 |
Calls: | 12,958 |
Files: | 186,574 |
Messages: | 3,265,641 |