Forum: War Ensemble BBS

Arm ldaxr / stxr loop question

From jseigh@jseigh_es00@xemaps.com to comp.arch on Mon Oct 28 15:13:03 2024

From Newsgroup: comp.arch

So if were to implement a spinlock using the above instructions
something along the lines of

.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed

The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.

Loads from the locked critical region could move forward of
the stxr but there's a control dependency from cbnz branch
instruction so they would be speculative loads until the
loop exited.

You'd still potentially have loads before the store of
the lockword but in this case that's not a problem
since it's known the lockword was 0 and no stores
from prior locked code could occur.

This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.

Is this sort of what's going on or is the explicit
acquire memory barrier still needed?

Joe Seigh

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 31 19:12:43 2024

From Newsgroup: comp.arch

On Mon, 28 Oct 2024 19:13:03 +0000, jseigh wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

..L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed

The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.

Loads from the locked critical region could move forward of
the stxr but there's a control dependency from cbnz branch
instruction so they would be speculative loads until the
loop exited.

You'd still potentially have loads before the store of
the lockword but in this case that's not a problem
since it's known the lockword was 0 and no stores
from prior locked code could occur.

This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.

Is this sort of what's going on or is the explicit
acquire memory barrier still needed?

Joe Seigh

My guess is that so few of us understand ARM fence
mechanics that we cannot address teh asked question.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Oct 31 12:39:43 2024

From Newsgroup: comp.arch

On 10/28/2024 12:13 PM, jseigh wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

.L0
    ldaxr    -- load lockword exclusive w/ acquire membar
    cmp      -- compare to zero
    bne .LO -- loop if currently locked
        stxr     -- store 1
        cbnz .LO -- retry if stxr failed

The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.

I am not well versed with arm. On the sparc for locking a spinlock it basically goes like:

atomic logic that locks the spinlock
MEMBAR #LoadStore | #LoadLoad

// critical section

MEMBAR #LoadStore | #StoreStore
atomic logic that unlocks the spinlock

Now, this is different than some spinlock logic aka, Peterson's
algorithm that requires a #StoreLoad in the atomic logic itself that
actually locks the spinlock. Basically, it does the same thing that the original SMR does. A store followed by a load to a different location
must hold. RMO aside, even TSO cannot handle that without a membar...

Loads from the locked critical region could move forward of
the stxr but there's a control dependency from cbnz branch
instruction so they would be speculative loads until the
loop exited.

You'd still potentially have loads before the store of
the lockword but in this case that's not a problem
since it's known the lockword was 0 and no stores
from prior locked code could occur.

This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.

Is this sort of what's going on or is the explicit
acquire memory barrier still needed?

Joe Seigh

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 31 20:35:58 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 28 Oct 2024 19:13:03 +0000, jseigh wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

..L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed

The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.

Loads from the locked critical region could move forward of
the stxr but there's a control dependency from cbnz branch
instruction so they would be speculative loads until the
loop exited.

You'd still potentially have loads before the store of
the lockword but in this case that's not a problem
since it's known the lockword was 0 and no stores
from prior locked code could occur.

This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.

Is this sort of what's going on or is the explicit
acquire memory barrier still needed?

Joe Seigh

My guess is that so few of us understand ARM fence
mechanics that we cannot address teh asked question.

Load-Acquire Exclusive Register derives an address from a base
register value, loads a 32-bit word or 64-bit doubleword from memory,
and writes it to a register. The memory access is atomic. The PE marks
the physical address being accessed as an exclusive access. This exclusive
access mark is checked by Store Exclusive instructions. See Synchronization
and semaphores. The instruction also has memory ordering semantics as
described in Load-Acquire, Load-AcquirePC, and Store-Release. For
information about memory accesses, see Load/store addressing modes.

Arm provides a set of instructions with Acquire semantics for loads,
and Release semantics for stores. These instructions support the
Release Consistency sequentially consistent (RCsc) model. In addition,
FEAT_LRCPC provides Load-AcquirePC instructions. The combination of
Load-AcquirePC and Store-Release can be use to support the weaker Release
Consistency processor consistent (RCpc) model.

The full definitions of the Load-Acquire and Load-AcquirePC instructions
are covered formally in the Definition of the Arm memory model.

https://developer.arm.com/documentation/102105/latest/
--- Synchronet 3.20a-Linux NewsLink 1.114

From aph@aph@littlepinkcloud.invalid to comp.arch on Fri Nov 1 16:17:49 2024

From Newsgroup: comp.arch

jseigh <jseigh_es00@xemaps.com> wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed

The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.

After the store exclusive, you mean? No, it would not be necessary.

This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.

Is this sort of what's going on or is the explicit
acquire memory barrier still needed?

All of the implementations of things like POSIX mutexes I've seen on
AArch64 use acquire alone.

Andrew.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 2 12:10:30 2024

From Newsgroup: comp.arch

On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:

jseigh <jseigh_es00@xemaps.com> wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed

The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.

After the store exclusive, you mean? No, it would not be necessary.

Ahhhh! I just learned something about ARM right here. I am so used to
the acquire membar being placed _after_ the atomic logic that locks the spinlock.

.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed

So this acts just like a SPARC style:

atomically_lock_spinlock();
membar #LoadStore | #LoadLoad

right?

This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.

Is this sort of what's going on or is the explicit
acquire memory barrier still needed?

All of the implementations of things like POSIX mutexes I've seen on
AArch64 use acquire alone.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Fri Nov 8 03:17:51 2024

From Newsgroup: comp.arch

On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed

The closest I could find to this was on page 8367
of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:

Loop
LDAXR W5, [X1] ; read lock with acquire
CBNZ W5, Loop ; check if 0
STXR W5, W0, [X1] ; attempt to store new value
CBNZ W5, Loop ; test if store succeeded and retry if not
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Nov 8 14:19:16 2024

From Newsgroup: comp.arch

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed

The closest I could find to this was on page 8367
of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:

DDI0487K_a is the most recent.

Loop
LDAXR W5, [X1] ; read lock with acquire
CBNZ W5, Loop ; check if 0
STXR W5, W0, [X1] ; attempt to store new value
CBNZ W5, Loop ; test if store succeeded and retry if not

A real world example from the linux kernel:

static __always_inline s64
__ll_sc_atomic64_dec_if_positive(atomic64_t *v)
{
s64 result;
unsigned long tmp;

asm volatile("// atomic64_dec_if_positive\n"
" prfm pstl1strm, %2\n"
"1: ldxr %0, %2\n"
" subs %0, %0, #1\n"
" b.lt 2f\n"
" stlxr %w1, %0, %2\n"
" cbnz %w1, 1b\n"
" dmb ish\n"
"2:"
: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
:
: "cc", "memory");

return result;
}

--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 8 13:40:17 2024

From Newsgroup: comp.arch

On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:

On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:

jseigh <jseigh_es00@xemaps.com> wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

.L0
        ldaxr    -- load lockword exclusive w/ acquire membar
        cmp      -- compare to zero
        bne .LO -- loop if currently locked
         stxr     -- store 1
         cbnz .LO -- retry if stxr failed

The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.

After the store exclusive, you mean? No, it would not be necessary.

Ahhhh! I just learned something about ARM right here. I am so used to
the acquire membar being placed _after_ the atomic logic that locks the spinlock.

.L0
         ldaxr    -- load lockword exclusive w/ acquire membar
         cmp      -- compare to zero
         bne .LO -- loop if currently locked
          stxr     -- store 1
          cbnz .LO -- retry if stxr failed

So this acts just like a SPARC style:

atomically_lock_spinlock();
membar #LoadStore | #LoadLoad

right?

Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?

Is this basically right? Or, what am I missing here? Thanks.

The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.

Am I missing anything wrt ARM? ;^o

This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.

Is this sort of what's going on or is the explicit
acquire memory barrier still needed?

All of the implementations of things like POSIX mutexes I've seen on
AArch64 use acquire alone.

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Nov 8 22:45:51 2024

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:

On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:

jseigh <jseigh_es00@xemaps.com> wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

Fwiw, I am basically asking if the "store" stxr has implied acquire >semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?

Is this basically right? Or, what am I missing here? Thanks.

The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.

Am I missing anything wrt ARM? ;^o

Did you read the extensive description of memory semantics
in the ARMv8 ARM? See page 275 in DDI0487K_a.

https://developer.arm.com/documentation/ddi0487/ka/?lang=en
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 8 14:56:51 2024

From Newsgroup: comp.arch

On 11/8/2024 2:45 PM, Scott Lurndal wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:

On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:

jseigh <jseigh_es00@xemaps.com> wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?

Is this basically right? Or, what am I missing here? Thanks.

The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.

Am I missing anything wrt ARM? ;^o

Did you read the extensive description of memory semantics
in the ARMv8 ARM? See page 275 in DDI0487K_a.

https://developer.arm.com/documentation/ddi0487/ka/?lang=en

I did not! So I am flying a mostly blind here. I don't really have any experience with how ARM handles these types of things. Just guessing
that the store would honor the acquire of the load? Or, does the store
need a membar and the load does not need acquire at all? I know that the membar should be after the final store that actually locks the spinlock
wrt Joe's example.

I just need to RTFM!!!!

Sorry about that Scott. ;^o

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of pessimistic CAS RMW type of logic?
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 8 15:03:39 2024

From Newsgroup: comp.arch

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

On 11/8/2024 2:45 PM, Scott Lurndal wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:

On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:

jseigh <jseigh_es00@xemaps.com> wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?

Is this basically right? Or, what am I missing here? Thanks.

The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.

Am I missing anything wrt ARM? ;^o

Did you read the extensive description of memory semantics
in the ARMv8 ARM? See page 275 in DDI0487K_a.

https://developer.arm.com/documentation/ddi0487/ka/?lang=en

I did not! So I am flying a mostly blind here. I don't really have any experience with how ARM handles these types of things. Just guessing
that the store would honor the acquire of the load? Or, does the store
need a membar and the load does not need acquire at all? I know that the membar should be after the final store that actually locks the spinlock
wrt Joe's example.

I just need to RTFM!!!!

Sorry about that Scott. ;^o

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Nov 8 23:36:24 2024

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

The two concepts are orthogonal in my experience.

ARM saw the deficiences of LL/SC very early in the
V8 architectural definition, and added a set of
atomic instructions for scalability to large processor
counts - one advantage is that the atomic operations
can be delegated to a cache level or memory, thus potentially
a very minor power savings in cases where contention is
common (although such LL/SC try loops often include the ARM
equivalent of the x86 PAUSE or MWAIT instructions to
all power savings during the spin).

Do read B2.3 Definition of the Arm memory model. It's only 32 pages,
and very clearly defines the memory model.
--- Synchronet 3.20a-Linux NewsLink 1.114

From jseigh@jseigh_es00@xemaps.com to comp.arch on Fri Nov 8 19:34:55 2024

From Newsgroup: comp.arch

On 11/8/24 17:56, Chris M. Thomasson wrote:

On 11/8/2024 2:45 PM, Scott Lurndal wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:

On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:

jseigh <jseigh_es00@xemaps.com> wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?

Is this basically right? Or, what am I missing here? Thanks.

The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.

Am I missing anything wrt ARM? ;^o

Did you read the extensive description of memory semantics
in the ARMv8 ARM? See page 275 in DDI0487K_a.

https://developer.arm.com/documentation/ddi0487/ka/?lang=en

I did not! So I am flying a mostly blind here. I don't really have any experience with how ARM handles these types of things. Just guessing
that the store would honor the acquire of the load? Or, does the store
need a membar and the load does not need acquire at all? I know that the membar should be after the final store that actually locks the spinlock
wrt Joe's example.

I just need to RTFM!!!!

Sorry about that Scott. ;^o

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of pessimistic CAS RMW type of logic?

In this case the the stxr doesn't need a memory barrier.
Loads can move forward of it but not forward of the ldaxr
because it has acquire semantics. For a lock that's ok
since the stxr would fail if any other thread acquired
the lock the conditional branch would make the loads
speculative if the stxr failed I believe.

Joe Seigh
--- Synchronet 3.20a-Linux NewsLink 1.114

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Nov 8 21:00:53 2024

From Newsgroup: comp.arch

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Nov 9 14:23:47 2024

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 9 13:08:31 2024

From Newsgroup: comp.arch

On 11/9/2024 6:23 AM, Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.

Ohhhhhh... That's nice! So, we can use both flavors wrt optimistic and pessimistic? Fwiw, I still did not get a chance to read up on the docs
you so kindly linked me to.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 9 13:47:43 2024

From Newsgroup: comp.arch

On 11/8/2024 4:34 PM, jseigh wrote:

On 11/8/24 17:56, Chris M. Thomasson wrote:

On 11/8/2024 2:45 PM, Scott Lurndal wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:

On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:

jseigh <jseigh_es00@xemaps.com> wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?

Is this basically right? Or, what am I missing here? Thanks.

The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.

Am I missing anything wrt ARM? ;^o

Did you read the extensive description of memory semantics
in the ARMv8 ARM? See page 275 in DDI0487K_a.

https://developer.arm.com/documentation/ddi0487/ka/?lang=en

I did not! So I am flying a mostly blind here. I don't really have any
experience with how ARM handles these types of things. Just guessing
that the store would honor the acquire of the load? Or, does the store
need a membar and the load does not need acquire at all? I know that
the membar should be after the final store that actually locks the
spinlock wrt Joe's example.

I just need to RTFM!!!!

Sorry about that Scott. ;^o

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?

In this case the the stxr doesn't need a memory barrier.

So, once that stxr completes it already has acquire membar semantics
from its prior load wrt its acquire? Never mind. I am busy right now on
some other things.

Loads can move forward of it but not forward of the ldaxr
because it has acquire semantics. For a lock that's ok
since the stxr would fail if any other thread acquired
the lock the conditional branch would make the loads
speculative if the stxr failed I believe.

Joe Seigh

--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 9 14:26:16 2024

From Newsgroup: comp.arch

On 11/8/2024 6:19 AM, Scott Lurndal wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:

So if were to implement a spinlock using the above instructions
something along the lines of

.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed

The closest I could find to this was on page 8367
of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:

DDI0487K_a is the most recent.

Loop
LDAXR W5, [X1] ; read lock with acquire
CBNZ W5, Loop ; check if 0
STXR W5, W0, [X1] ; attempt to store new value
CBNZ W5, Loop ; test if store succeeded and retry if not

A real world example from the linux kernel:

static __always_inline s64
__ll_sc_atomic64_dec_if_positive(atomic64_t *v)
{
s64 result;
unsigned long tmp;

asm volatile("// atomic64_dec_if_positive\n"
" prfm pstl1strm, %2\n"
"1: ldxr %0, %2\n"
" subs %0, %0, #1\n"
" b.lt 2f\n"
" stlxr %w1, %0, %2\n"
" cbnz %w1, 1b\n"
" dmb ish\n"

"dmb ish" is interesting to me for some reason...

"2:"
: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
:
: "cc", "memory");

return result;
}

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Nov 9 23:18:14 2024

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/8/2024 6:19 AM, Scott Lurndal wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

A real world example from the linux kernel:

static __always_inline s64
__ll_sc_atomic64_dec_if_positive(atomic64_t *v)
{
s64 result;
unsigned long tmp;

asm volatile("// atomic64_dec_if_positive\n"
" prfm pstl1strm, %2\n"
"1: ldxr %0, %2\n"
" subs %0, %0, #1\n"
" b.lt 2f\n"
" stlxr %w1, %0, %2\n"
" cbnz %w1, 1b\n"
" dmb ish\n"

"dmb ish" is interesting to me for some reason...

Data Memory Barrior - inner sharable coherency domain
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Nov 10 01:26:22 2024

From Newsgroup: comp.arch

On Sat, 9 Nov 2024 23:18:14 +0000, Scott Lurndal wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/8/2024 6:19 AM, Scott Lurndal wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

A real world example from the linux kernel:

static __always_inline s64
__ll_sc_atomic64_dec_if_positive(atomic64_t *v)
{
s64 result;
unsigned long tmp;

asm volatile("// atomic64_dec_if_positive\n"
" prfm pstl1strm, %2\n"
"1: ldxr %0, %2\n"
" subs %0, %0, #1\n"
" b.lt 2f\n"
" stlxr %w1, %0, %2\n"
" cbnz %w1, 1b\n"
" dmb ish\n"

"dmb ish" is interesting to me for some reason...

Data Memory Barrior - inner sharable coherency domain

It reads better without explanation ...
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Sun Nov 10 01:37:26 2024

From Newsgroup: comp.arch

On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

It reads better without explanation ...

Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

Can anybody find any other example of any IBM engineer ever having a sense
of humour? Ever?

Anybody?
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Nov 10 02:44:39 2024

From Newsgroup: comp.arch

On Sun, 10 Nov 2024 1:37:26 +0000, Lawrence D'Oliveiro wrote:

On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

It reads better without explanation ...

Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

Can anybody find any other example of any IBM engineer ever having a
sense of humour?

We got past our censors (management) a control register
in Mc 88100 called FPECR -- Floating Point Exception
Control Register. We were rather happy about it, too.

Ed Rupp (wrote the 68020/30) µCode assembler. Due to the
way we implemented µROM, we could interchange rows and
columns to optimize various stuff. We (the engineers)
got together one night and rearranged the rows and
columns such that if you looked at µROM from a good
distance back, you would see "Moto Man Lives" in bits
across the ROM. ...
Actually got in trouble for that one ...

Ever?

Anybody?

--- Synchronet 3.20a-Linux NewsLink 1.114

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Nov 10 16:00:23 2024

From Newsgroup: comp.arch

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.

Thanks, I missed that.

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?

The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Nov 10 23:08:21 2024

From Newsgroup: comp.arch

On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.

Thanks, I missed that.

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction?

The advantage is consuming OpCode space at breathtaking speed.
Oh wait...

Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

Because the memory model was not build with the notion of memory order
and that not all ATOMIC events start or end with a recognizable inst-
ruction. Having ATOMICs announce their beginning and ending eliminates
the need for fencing; even if you keep a <relatively> relaxed memory
order model.

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?

The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

Blame Leslie Lamport for those requirements.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Nov 11 12:41:08 2024

From Newsgroup: comp.arch

On Sun, 10 Nov 2024 16:00:23 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead
of pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.

Thanks, I missed that.

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?

The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

The correct question is not "Why to have them?", but "Why not?".
In the ISA with fixed 32-bit instructions and with 32 GPRs, opcode space
for 2-reg operations without immediate is extremely cheap.

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 11 13:57:55 2024

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>> pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.

Thanks, I missed that.

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?

The execution time of each is the same, and the main cost is the fence >synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

"Limited ordering regions allow large systems to perform
special Load-Acquire and Store-Release instructions that
provide order between the memory accesses to a region of
the PA map as observed by a limited set of observers."

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 11 13:59:22 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>> pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.

Thanks, I missed that.

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction?

The advantage is consuming OpCode space at breathtaking speed.
Oh wait...

Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

Because the memory model was not build with the notion of memory order
and that not all ATOMIC events start or end with a recognizable inst- >ruction. Having ATOMICs announce their beginning and ending eliminates
the need for fencing; even if you keep a <relatively> relaxed memory
order model.

There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Nov 11 16:28:48 2024

From Newsgroup: comp.arch

On Mon, 11 Nov 2024 13:59:22 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC
instead of pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.

Thanks, I missed that.

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction?

The advantage is consuming OpCode space at breathtaking speed.
Oh wait...

Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

Because the memory model was not build with the notion of memory
order and that not all ATOMIC events start or end with a
recognizable inst- ruction. Having ATOMICs announce their beginning
and ending eliminates the need for fencing; even if you keep a
<relatively> relaxed memory order model.

There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.

Also for compatibility with Cortex-A53 which is still a significant
part of installed base.

--- Synchronet 3.20a-Linux NewsLink 1.114

From jseigh@jseigh_es00@xemaps.com to comp.arch on Mon Nov 11 09:56:44 2024

From Newsgroup: comp.arch

On 11/11/24 08:59, Scott Lurndal wrote:

There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.

They added the atomics for scalability allegedly. ARM never
stated what the actual issue was. I suspect they couldn't
guarantee a memory lock size small enough to eliminate
destructive interference. Like cache line size instead
of word size.

Joe Seigh
--- Synchronet 3.20a-Linux NewsLink 1.114

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Nov 11 11:30:46 2024

From Newsgroup: comp.arch

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>> pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.

Thanks, I missed that.

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?

The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

"Limited ordering regions allow large systems to perform
special Load-Acquire and Store-Release instructions that
provide order between the memory accesses to a region of
the PA map as observed by a limited set of observers."

Ok, so that explains LoadLOAcquire, StoreLORelease as they are
functionally different: it needs to associate the fence with specific
load and store addresses so it can determine a physical LORegion,
if any, and thereby limit the scope of the fence actions to that LOR.

But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
Why attach a specific kind of fence action to the general LD or ST?
They do the same thing in the atomic instructions, eg:

LDADDB, LDADDAB, LDADDALB, LDADDLB
Atomic add on byte in memory atomically loads an 8-bit byte from memory,
adds the value held in a register to it, and stores the result back to
memory. The value initially loaded from memory is returned in the
destination register.
- If the destination register is not WZR, LDADDAB and LDADDALB load from
memory with acquire semantics.
- LDADDLB and LDADDALB store to memory with release semantics.
- LDADDB has neither acquire nor release semantics.

And this goes on and on for all the other atomic ops, SWP, CAS, CLR, EOR,
SET, SMIN, SMAX, UMIN, UMAX, and data sizes, half, word, dblword, pair.

What happens if like Apple you want Processor Consistency model too -
instead of just adding one new fence instruction, do they have to add
all the atomic instructions (ops * sizes) in again?

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 11 17:11:10 2024

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>>> pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.

Thanks, I missed that.

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?

The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

"Limited ordering regions allow large systems to perform
special Load-Acquire and Store-Release instructions that
provide order between the memory accesses to a region of
the PA map as observed by a limited set of observers."

Ok, so that explains LoadLOAcquire, StoreLORelease as they are
functionally different: it needs to associate the fence with specific
load and store addresses so it can determine a physical LORegion,
if any, and thereby limit the scope of the fence actions to that LOR.

But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
Why attach a specific kind of fence action to the general LD or ST?
They do the same thing in the atomic instructions, eg:

Note that the atomics were added in V8.1, and were optional at that
time.

From the ARMv8 ARM:

Arm provides a set of instructions with Acquire semantics for
loads, and Release semantics for stores. These instructions
support the Release Consistency sequentially consistent (RCsc) model.
In addition, FEAT_LRCPC provides Load-AcquirePC instructions. The
combination of Load-AcquirePC and Store-Release can be use to
support the weaker Release Consistency processor consistent (RCpc) model. --- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 11 17:17:56 2024

From Newsgroup: comp.arch

jseigh <jseigh_es00@xemaps.com> writes:

On 11/11/24 08:59, Scott Lurndal wrote:

There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.

They added the atomics for scalability allegedly. ARM never
stated what the actual issue was. I suspect they couldn't
guarantee a memory lock size small enough to eliminate
destructive interference. Like cache line size instead
of word size.

Speculation is seldom accurate. I would suggest that it
is more likely that there were requests from ARM customers
who were looking to build larger SMP systems and it had been
clear for decades that LL/SC could not scale to larger
processor counts.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Malcolm Beattie@mbeattie@clueful.co.uk to comp.arch on Mon Nov 11 18:17:54 2024

From Newsgroup: comp.arch

On 2024-11-10, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

It reads better without explanation ...

Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

Can anybody find any other example of any IBM engineer ever having a sense of humour? Ever?

One of the resource types in JES2, the batch subsystem for z/OS, is
BERT ("Block Extension Reuse Table") and needs some sizing/tuning by
the sysprog. Not too noticeable as humourous but for low-level use
from Assembler some of the macros which manipulate them allow you to
(1) copy one into memory, i.e. "Deliver Or Get" a BERT
(2) define a hook to get control when a BERT is released, i.e
"Do It Later" for a BERT release.
(3) generate a control block for a related data area, i.e. a
"Collector Attribute Table" for BERTs.

These macros are
(1) $DOGBERT
(2) $DILBERT
(3) $CATBERT

--Malcolm
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 11 13:53:43 2024

From Newsgroup: comp.arch

On 11/11/2024 6:56 AM, jseigh wrote:

On 11/11/24 08:59, Scott Lurndal wrote:

There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.

They added the atomics for scalability allegedly. ARM never
stated what the actual issue was. I suspect they couldn't
guarantee a memory lock size small enough to eliminate
destructive interference. Like cache line size instead
of word size.

For some reason it reminds me of the size of a reservation granule wrt
LL/SC.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 11 14:02:22 2024

From Newsgroup: comp.arch

On 11/11/2024 9:11 AM, Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>>>> pessimistic CAS RMW type of logic?

LL/SC vs cmpxchg8b?

Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).

Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.

Thanks, I missed that.

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction? >>>>
The execution time of each is the same, and the main cost is the fence >>>> synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

"Limited ordering regions allow large systems to perform
special Load-Acquire and Store-Release instructions that
provide order between the memory accesses to a region of
the PA map as observed by a limited set of observers."

Ok, so that explains LoadLOAcquire, StoreLORelease as they are
functionally different: it needs to associate the fence with specific
load and store addresses so it can determine a physical LORegion,
if any, and thereby limit the scope of the fence actions to that LOR.

But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
Why attach a specific kind of fence action to the general LD or ST?
They do the same thing in the atomic instructions, eg:

Note that the atomics were added in V8.1, and were optional at that
time.

From the ARMv8 ARM:

Arm provides a set of instructions with Acquire semantics for
loads, and Release semantics for stores. These instructions
support the Release Consistency sequentially consistent (RCsc) model.
In addition, FEAT_LRCPC provides Load-AcquirePC instructions. The
combination of Load-AcquirePC and Store-Release can be use to
support the weaker Release Consistency processor consistent (RCpc) model.

It sure seems like the "weaker" release is similar to unlocking a
spinlock with a store in x86, MOV because it already has implied release membar semantics aka (#LoadStore | #StoreStore).
--- Synchronet 3.20a-Linux NewsLink 1.114

From aph@aph@littlepinkcloud.invalid to comp.arch on Tue Nov 12 12:14:47 2024

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> wrote:

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?

All this, and much more can be discovered by reading the AMBA
specifications. However, the main point is that the content of the
target address does not have to be transferred to the local cache:
these are remote atomic operations. Quite nice for things like
fire-and-forget counters, for example.

The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Andrew.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Nov 12 13:55:25 2024

From Newsgroup: comp.arch

Malcolm Beattie <mbeattie@clueful.co.uk> schrieb:

On 2024-11-10, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

It reads better without explanation ...

Reminds me of the “EIEIO” instruction from IBM POWER (or was it only
PowerPC).

Can anybody find any other example of any IBM engineer ever having a sense >> of humour? Ever?

One of the resource types in JES2, the batch subsystem for z/OS, is
BERT ("Block Extension Reuse Table") and needs some sizing/tuning by
the sysprog. Not too noticeable as humourous but for low-level use
from Assembler some of the macros which manipulate them allow you to
(1) copy one into memory, i.e. "Deliver Or Get" a BERT
(2) define a hook to get control when a BERT is released, i.e
"Do It Later" for a BERT release.
(3) generate a control block for a related data area, i.e. a
"Collector Attribute Table" for BERTs.

These macros are
(1) $DOGBERT
(2) $DILBERT
(3) $CATBERT

Do you know if these macros existed before 1993, when Dilbert was
first released?
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 14:10:14 2024

From Newsgroup: comp.arch

On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

EricP <ThatWouldBeTelling@thevillage.com> wrote:

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?

All this, and much more can be discovered by reading the AMBA
specifications. However, the main point is that the content of the
target address does not have to be transferred to the local cache:
these are remote atomic operations. Quite nice for things like fire-and-forget counters, for example.

The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a store followed by a load to another location to hold. LoadStore is not strong enough. The SMR algorithm needs that. Iirc, Peterson's algorithms needs
it as well.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 14:25:48 2024

From Newsgroup: comp.arch

On 11/11/2024 1:53 PM, Chris M. Thomasson wrote:

On 11/11/2024 6:56 AM, jseigh wrote:

On 11/11/24 08:59, Scott Lurndal wrote:

There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.

They added the atomics for scalability allegedly. ARM never
stated what the actual issue was. I suspect they couldn't
guarantee a memory lock size small enough to eliminate
destructive interference. Like cache line size instead
of word size.

For some reason it reminds me of the size of a reservation granule wrt LL/SC.

For some reason I remember way back wrt having to pad and align things
on reservation granule's back on PPC. Iirc, it was the "anchor"
structure. The nodes were aligned and padded up to l2 cache lines. This
was 20+ years ago! damn it. Time goes by. Uggg. ;^o
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Tue Nov 12 16:55:40 2024

From Newsgroup: comp.arch

On 11/12/2024 6:14 AM, aph@littlepinkcloud.invalid wrote:

EricP <ThatWouldBeTelling@thevillage.com> wrote:

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?

All this, and much more can be discovered by reading the AMBA
specifications. However, the main point is that the content of the
target address does not have to be transferred to the local cache:
these are remote atomic operations. Quite nice for things like fire-and-forget counters, for example.

I ended up mostly with a simpler model, IMO:
Normal / RAM-like: Fetch cache line, write back when evicting;
Operations: LoadTile, StoreTile, SwapTile,
LoadPrefetch, StorePrefetch
Volatile (RAM like): Fetch, operate, write-back;
MMIO: Remote Load/Store/Swap request;
Operation is performed on target;
Currently only supports DWORD and QWORD access;
Operations are strictly sequential.

In theory, MMIO access could be added to RAM, but unclear if worth the
added cost and complexity of doing so. Could more easily enforce strict consistency.

The LoadPrefetch and StorePrefetch operations:
LoadPrefetch, try to perform a load from RAM
Always responds immediately
Signals whether it was an L2 hit or L2 Miss.
StorePrefetch
Basically like LoadPrefetch
Signals that the intention is to write to memory.

In my cache and bus design, I sometimes refer to cache lines as "tiles"
partly because of how I viewed them as operating, which didn't exactly
match the online descriptions of cache lines.

Say:
Tile:
16 bytes in the current implementation.
Accessed in even and odd rows
A memory access may span an even tile and an odd tile;
The L1 caches need to have a matched pair of tiles for an access.
Cache Line:
Usually described as always 32 bytes;
Descriptions seemed to assume only a single row of lines in caches.
Generally no mention of allowing for an even/odd scheme.

Seemingly, a cache that operated with cache lines would use a single row
of 32-bit cache lines, with misaligned accesses presumably spanning a
pair of adjacent cache lines. To fit with BRAM access patterns, would
likely need to split lines in half, and then mirror the relevant tag
bits (to allow detecting hit/miss).

However, online descriptions generally made no mention of how misaligned accesses were intended to be handled within the limits of a dual-ported
RAM (1R1W).

My L2 cache operates in a way more like that of traditional descriptions
of cache lines, except that they are currently 64 bytes in my L2 cache
(and internally subdivided into four 16-byte parts).

The use of 64 bytes was mostly because this size got the most bandwidth
with my DDR interface (with 16 or 32 byte transfers, more cycles are
spent overhead; however latency was lower).

In this case, the L2<->RAM interface:
512 bit Load Data
512 bit Store Data
Load Address
Store Address
Request Code (IDLE/LOAD/STORE/SWAP)
Request Sequence Number
Response Code (READY/OK/HOLD/FAIL)
Response Sequence Number

Originally, there were no sequence numbers, and IDLE/READY signaling was
used between each request (needed to return to this state before
starting a new request). The sequence numbers avoided needing to return
to an IDLE/READY state, allowing the bandwidth over this interface to be nearly doubled.

In a SWAP request, the Load and Store are performed end to end.

General bandwidth for a 16-bit DDR2 chip running at 50MHz (DLL disabled, effectively a low-power / standby mode) is ~ 90 MB/sec (or 47 MB/s each direction for SWAP), which is fairly close to the theoretical limit (internally, the logic for the DDR controller runs at 100MHz, driving IO
as 100MHz SDR, albeit using both posedge and negedge for sampling
responses from the DDR chip, so ~ 200 MHz if seen as SDR).

Theoretically, would be faster to access the chip using the SERDES
interface, but:
Hadn't gone up the learning curve for this;
Unclear if I could really effectively utilize the bandwidth with a 50MHz
CPU and my current bus;
Actual bandwidth gains would be smaller, as then CAS and RAS latency
would dominate.

Could in theory have used Vivado MIG, but then I would have needed to
deal with AXI, and never crossed the threshold of wanting to deal with AXI.

Between CPU, L2, and various other devices, I am using a ringbus:
Connections:
128 bits data;
48 bits address (96 bits between L1 caches and TLB);
16 bits: request/response code and flags;
16 bits: source/dest node and request sequence number;
Each node has a set of input and output connections;
Each node may modify a request/response,
or simply forward from input to output.
Messages move along at one position per clock cycle.
Generally also 50 MHz at present (*1).

*1: Pretty much everything (apart from some hardware interfaces) runs on
the same clock. Some devices needed faster clocks. Any slower clocks
were generally faked using accumulator dividers (add a fraction every clock-cycle and use the MSB of the accumulator as the virtual clock).

Comparably, the per-node logic cost isn't too high, nor is the logic complexity. However, performance of the ring is very sensitive to ring
latency (and there are some amount of hacks to try to reduce the overall latency of the ring in common paths).

At present, the highest resolution video modes that can be managed semi-effectively are 640x400 and 640x480 256-color (60Hz), or ~ 20 MB/sec.

Can do 800x600 or similar in RGBI or color-cell modes (640x400 or
640x480 CC also being an option). Theoretically, there is a 1024x768 monochrome mode, but this is mostly untested. The 4-color and monochrome
modes had optional Bayer-pattern sub-modes to mimic full color.

Main modes I have ended up using:
80x25 and 80x50 text/color-cell modes;
Text and color cell graphics exist in the same mode.
320x200 hi-color (RGB555);
640x400 indexed 256 color.

Trying to go much higher than this, and the combination of ringbus
latency and L2 misses turns the display into a broken mess (with a DRAM
backed framebuffer). Originally, I had the framebuffer in Block-RAM, but
this in turn set the hard-limit based on framebuffer size (and putting framebuffer in DRAM allowing for a bigger L2 cache).

Theoretically, could allow higher resolution modes by adding a fast path between the display output and DDR RAM interface (with access then being multiplexed with the L2 cache). Have not done so.

Or, possible but more radical:
Bolt the VGA output module directly to the L2 cache;
Could theoretically do 800x600 high-color
Would eat around 2/3 of total RAM bandwidth.

Major concern here is that setting resolutions too high would starve the
CPU of the ability to access memory (vs the current situation where
trying to set higher resolutions mostly results in progressively worse
display glitches).

Logic would need to be in place so that display can't totally hog the
RAM interface. If doing so, may also make sense to move from color-cell
and block-organized memory to fully raster oriented frame-buffers.

Though, despite being more wonky / non-standard, the block-oriented framebuffer layout has tended to play along better with memory fetch. A
raster oriented framebuffer is much more sensitive to timing and access-latency issues compared with 4x4 or 8x8 pixel blocks, with the
display working on an internal cache of around 2 .. 4 rows of blocks.

Raster generally needs results to be streamed in-order and at a
consistent latency, whereas blocks can use hit/miss handling, with a
hit/miss probe running ahead of the current raster position (and
hopefully able to get the block fetched before it is time to display
it). Though, did add logic to the display to avoid sensing new prefetch requests for a block if it is still waiting for a response on that block (mostly as otherwise the VRAM cache was spamming the ringbus with
excessive prefetch requests). Where, in this case, the VRAM was using exclusively prefetch requests during screen refresh.

In practice, it may not matter that much if the hardware framebuffer is block-ordered rather than raster. The OS's display driver is the only
thing that really needs to care. Main case where it could arguably
"actually matter" being full-screen programs using a DirectDraw style interface, but likely doesn't matter that much if the program is given
an off-screen framebuffer to draw into rather than the actual hardware framebuffer (with the contents being copied over during a "buffer swap" event).

But, as noted, I was mostly using a partly GDI+VfW inspired interface,
which seems "mostly OK". Difference in overhead isn't that large; and
"Draw this here bitmap onto this HDC" offers a certain level of hardware abstraction; as there is no implicit assumption that the pixel format in
ones' bitmap object needs to match the format and layout of the display device.

Nevermind if for GUI like operation, programs/windows were mostly
operating in hi-color, with stuff being awkwardly converted to 256 color during window-stack redraw. Granted, pretty sure old-style Windows
didn't work this way, and per-window framebuffers eat a lot of RAM (note
that the shell had tabs, but all the tabs share a single window
framebuffer; rather each has a separate character cell buffer, and the
cells are redrawn to the window buffer either when switching tabs or
when more text is printed).

Had considered option for 256-color or 16 color window buffers (to save
RAM), but haven't done so yet (for now, if drawing a 16 or 256 color
bitmap, it is internally converted to hi-color). More likely, would
switch to 256 color window buffers if using a 256 color output mode (so conversion to 256 color would be handled when drawing the bitmap to the window, rather than later in the process).

Well, I guess sort of similar wonk that the internal audio mixing is
using 16-bit PCM, whereas the output is A-Law (for the hardware loop
buffer). But, a case could be made for doing the OS level audio mixing
as Binary16.

Either way, longer term future of my project is uncertain...

And, unclear if a success or failure.

It mostly did about as well as I could expect.

Never did achieve my original goal of "fast enough to run Quake at
decent framerates", but partly because younger self didn't realize
everything would be stuck at 50 MHz (or that 75 or 100 MHz core would
end up needing to be comparably anemic scalar RISC cores; which still
can't really get good framerates in Quake, *).

*: A 100 MHz RV64G core still will not make Quake fast. Extra not helped
if one needs to use smaller L1 caches and generate stalls on memory RAW hazards, ...

Sadly, closest I have gotten thus far involves GLQuake and a hardware rasterizer module. And then the problem that performance is more limited
by trying to get geometry processed and fed into the module, than its
ability to walk edges and rasterize stuff. Would seemingly ideally need something that could also perform transforms and deal with
perspective-correct texture filtering (vs CPU side transforms, and
dynamic subdivision + affine texture filtering).

...

The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Andrew.

--- Synchronet 3.20a-Linux NewsLink 1.114

From aph@aph@littlepinkcloud.invalid to comp.arch on Tue Nov 12 23:02:13 2024

From Newsgroup: comp.arch

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.

That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

Andrew.
--- Synchronet 3.20a-Linux NewsLink 1.114

From jseigh@jseigh_es00@xemaps.com to comp.arch on Tue Nov 12 18:55:42 2024

From Newsgroup: comp.arch

On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.

That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

Andrew.

Does ARM use acquire and release differently than everyone else?
I'm not sure where StoreLoad fits in with those.

Joe Seigh
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Nov 13 00:29:42 2024

From Newsgroup: comp.arch

On Tue, 12 Nov 2024 23:02:13 +0000, aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.

That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

1990-1992: I was working on Mc88120. It had a conditional cache--a
place to store store-data until the store instruction became consistent.
After becoming consistent, the store data would migrate to L1 or on
to DRAM, ... This structure could be probed for memory order rather
similar to what ARM is doing.

Andrew.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 16:32:00 2024

From Newsgroup: comp.arch

On 11/12/2024 3:02 PM, aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.

That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad.

Ahhh. So, well, it makes me think of the implied StoreLoad in x86/x64
LOCK'ed RMW's...? Does this make any sense to you? Or, am I wandering
around in a damn field somewhere! ;^o

I am so used to SPARC style in RMO mode. The LoadStore should be _after_
any "naked", but atomic logic that acquires and releases a spinlock...
;^o acquire with regard to the memory barrier logic, -not- the atomic
logic that gains the lock itself....

LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 17:43:33 2024

From Newsgroup: comp.arch

On 11/12/2024 4:32 PM, Chris M. Thomasson wrote:

On 11/12/2024 3:02 PM, aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a >>>> STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.

That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad.

Ahhh. So, well, it makes me think of the implied StoreLoad in x86/x64 LOCK'ed RMW's...? Does this make any sense to you? Or, am I wandering
around in a damn field somewhere! ;^o

I am so used to SPARC style in RMO mode. The LoadStore should be _after_
any "naked", but atomic logic that acquires and releases a
spinlock...

I need to clarify here. Shit. The LoadStore LoadLoad should be after the atomic logic that acquires the spinlock. It should be before the release LoadStore StoreStore that should occur before the atomic logic that
releases the spinlock. Humm... How much more complicated can I make it?
Sorry.

__________
naked_atomic_lock()
membar #LoadStore | #LoadLoad

// locked region

membar #LoadStore | #StoreStore
naked_atomic_unlock();
__________

Damn it. Sorry everybody!

[...]
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 17:50:11 2024

From Newsgroup: comp.arch

On 11/12/2024 3:02 PM, aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.

That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

Iirc, a sequential membar was strange on SPARC. I have seen things like
this before wrt RMO mode:

membar #StoreLoad | #LoadStore | #LoadLoad | #StoreStore

shit. It's been a while! damn.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 17:53:58 2024

From Newsgroup: comp.arch

On 11/12/2024 2:55 PM, BGB wrote:

On 11/12/2024 6:14 AM, aph@littlepinkcloud.invalid wrote:

EricP <ThatWouldBeTelling@thevillage.com> wrote:

Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register

plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?

All this, and much more can be discovered by reading the AMBA
specifications. However, the main point is that the content of the
target address does not have to be transferred to the local cache:
these are remote atomic operations. Quite nice for things like
fire-and-forget counters, for example.

I ended up mostly with a simpler model, IMO:
Normal / RAM-like: Fetch cache line, write back when evicting;
    Operations: LoadTile, StoreTile, SwapTile,
      LoadPrefetch, StorePrefetch
Volatile (RAM like): Fetch, operate, write-back;
MMIO: Remote Load/Store/Swap request;
    Operation is performed on target;
    Currently only supports DWORD and QWORD access;
    Operations are strictly sequential.

In theory, MMIO access could be added to RAM, but unclear if worth the
added cost and complexity of doing so. Could more easily enforce strict consistency.

The LoadPrefetch and StorePrefetch operations:
LoadPrefetch, try to perform a load from RAM
    Always responds immediately
    Signals whether it was an L2 hit or L2 Miss.
StorePrefetch
    Basically like LoadPrefetch
    Signals that the intention is to write to memory.

In my cache and bus design, I sometimes refer to cache lines as "tiles" partly because of how I viewed them as operating, which didn't exactly
match the online descriptions of cache lines.

Say:
Tile:
    16 bytes in the current implementation.
    Accessed in even and odd rows
      A memory access may span an even tile and an odd tile;
      The L1 caches need to have a matched pair of tiles for an access.
Cache Line:
    Usually described as always 32 bytes;
    Descriptions seemed to assume only a single row of lines in caches.
      Generally no mention of allowing for an even/odd scheme.

Seemingly, a cache that operated with cache lines would use a single row
of 32-bit cache lines, with misaligned accesses presumably spanning a
pair of adjacent cache lines. To fit with BRAM access patterns, would
likely need to split lines in half, and then mirror the relevant tag
bits (to allow detecting hit/miss).

However, online descriptions generally made no mention of how misaligned accesses were intended to be handled within the limits of a dual-ported
RAM (1R1W).

My L2 cache operates in a way more like that of traditional descriptions
of cache lines, except that they are currently 64 bytes in my L2 cache
(and internally subdivided into four 16-byte parts).

The use of 64 bytes was mostly because this size got the most bandwidth
with my DDR interface (with 16 or 32 byte transfers, more cycles are
spent overhead; however latency was lower).

In this case, the L2<->RAM interface:
512 bit Load Data
512 bit Store Data
Load Address
Store Address
Request Code (IDLE/LOAD/STORE/SWAP)
Request Sequence Number
Response Code (READY/OK/HOLD/FAIL)
Response Sequence Number

Originally, there were no sequence numbers, and IDLE/READY signaling was used between each request (needed to return to this state before
starting a new request). The sequence numbers avoided needing to return
to an IDLE/READY state, allowing the bandwidth over this interface to be nearly doubled.

In a SWAP request, the Load and Store are performed end to end.

General bandwidth for a 16-bit DDR2 chip running at 50MHz (DLL disabled, effectively a low-power / standby mode) is ~ 90 MB/sec (or 47 MB/s each direction for SWAP), which is fairly close to the theoretical limit (internally, the logic for the DDR controller runs at 100MHz, driving IO
as 100MHz SDR, albeit using both posedge and negedge for sampling
responses from the DDR chip, so ~ 200 MHz if seen as SDR).

Theoretically, would be faster to access the chip using the SERDES interface, but:
Hadn't gone up the learning curve for this;
Unclear if I could really effectively utilize the bandwidth with a 50MHz
CPU and my current bus;
Actual bandwidth gains would be smaller, as then CAS and RAS latency
would dominate.

Could in theory have used Vivado MIG, but then I would have needed to
deal with AXI, and never crossed the threshold of wanting to deal with AXI.

Between CPU, L2, and various other devices, I am using a ringbus:
Connections:
    128 bits data;
    48 bits address (96 bits between L1 caches and TLB);
    16 bits: request/response code and flags;
    16 bits: source/dest node and request sequence number;
Each node has a set of input and output connections;
    Each node may modify a request/response,
      or simply forward from input to output.
    Messages move along at one position per clock cycle.
      Generally also 50 MHz at present (*1).

*1: Pretty much everything (apart from some hardware interfaces) runs on
the same clock. Some devices needed faster clocks. Any slower clocks
were generally faked using accumulator dividers (add a fraction every clock-cycle and use the MSB of the accumulator as the virtual clock).

Comparably, the per-node logic cost isn't too high, nor is the logic complexity. However, performance of the ring is very sensitive to ring latency (and there are some amount of hacks to try to reduce the overall latency of the ring in common paths).

At present, the highest resolution video modes that can be managed semi- effectively are 640x400 and 640x480 256-color (60Hz), or ~ 20 MB/sec.

Can do 800x600 or similar in RGBI or color-cell modes (640x400 or
640x480 CC also being an option). Theoretically, there is a 1024x768 monochrome mode, but this is mostly untested. The 4-color and monochrome modes had optional Bayer-pattern sub-modes to mimic full color.

Main modes I have ended up using:
80x25 and 80x50 text/color-cell modes;
    Text and color cell graphics exist in the same mode.
320x200 hi-color (RGB555);
640x400 indexed 256 color.

Trying to go much higher than this, and the combination of ringbus
latency and L2 misses turns the display into a broken mess (with a DRAM backed framebuffer). Originally, I had the framebuffer in Block-RAM, but this in turn set the hard-limit based on framebuffer size (and putting framebuffer in DRAM allowing for a bigger L2 cache).

Theoretically, could allow higher resolution modes by adding a fast path between the display output and DDR RAM interface (with access then being multiplexed with the L2 cache). Have not done so.

Or, possible but more radical:
Bolt the VGA output module directly to the L2 cache;
Could theoretically do 800x600 high-color
    Would eat around 2/3 of total RAM bandwidth.

Major concern here is that setting resolutions too high would starve the
CPU of the ability to access memory (vs the current situation where
trying to set higher resolutions mostly results in progressively worse display glitches).

Logic would need to be in place so that display can't totally hog the
RAM interface. If doing so, may also make sense to move from color-cell
and block-organized memory to fully raster oriented frame-buffers.

Though, despite being more wonky / non-standard, the block-oriented framebuffer layout has tended to play along better with memory fetch. A raster oriented framebuffer is much more sensitive to timing and access- latency issues compared with 4x4 or 8x8 pixel blocks, with the display working on an internal cache of around 2 .. 4 rows of blocks.

Raster generally needs results to be streamed in-order and at a
consistent latency, whereas blocks can use hit/miss handling, with a hit/miss probe running ahead of the current raster position (and
hopefully able to get the block fetched before it is time to display
it). Though, did add logic to the display to avoid sensing new prefetch requests for a block if it is still waiting for a response on that block (mostly as otherwise the VRAM cache was spamming the ringbus with
excessive prefetch requests). Where, in this case, the VRAM was using exclusively prefetch requests during screen refresh.

In practice, it may not matter that much if the hardware framebuffer is block-ordered rather than raster. The OS's display driver is the only
thing that really needs to care. Main case where it could arguably
"actually matter" being full-screen programs using a DirectDraw style interface, but likely doesn't matter that much if the program is given
an off-screen framebuffer to draw into rather than the actual hardware framebuffer (with the contents being copied over during a "buffer swap" event).

But, as noted, I was mostly using a partly GDI+VfW inspired interface,
which seems "mostly OK". Difference in overhead isn't that large; and
"Draw this here bitmap onto this HDC" offers a certain level of hardware abstraction; as there is no implicit assumption that the pixel format in ones' bitmap object needs to match the format and layout of the display device.

Nevermind if for GUI like operation, programs/windows were mostly
operating in hi-color, with stuff being awkwardly converted to 256 color during window-stack redraw. Granted, pretty sure old-style Windows
didn't work this way, and per-window framebuffers eat a lot of RAM (note that the shell had tabs, but all the tabs share a single window
framebuffer; rather each has a separate character cell buffer, and the
cells are redrawn to the window buffer either when switching tabs or
when more text is printed).

Had considered option for 256-color or 16 color window buffers (to save RAM), but haven't done so yet (for now, if drawing a 16 or 256 color
bitmap, it is internally converted to hi-color). More likely, would
switch to 256 color window buffers if using a 256 color output mode (so conversion to 256 color would be handled when drawing the bitmap to the window, rather than later in the process).

Well, I guess sort of similar wonk that the internal audio mixing is
using 16-bit PCM, whereas the output is A-Law (for the hardware loop buffer). But, a case could be made for doing the OS level audio mixing
as Binary16.

Either way, longer term future of my project is uncertain...

And, unclear if a success or failure.

It mostly did about as well as I could expect.

Never did achieve my original goal of "fast enough to run Quake at
decent framerates", but partly because younger self didn't realize everything would be stuck at 50 MHz (or that 75 or 100 MHz core would
end up needing to be comparably anemic scalar RISC cores; which still
can't really get good framerates in Quake, *).

*: A 100 MHz RV64G core still will not make Quake fast. Extra not helped
if one needs to use smaller L1 caches and generate stalls on memory RAW hazards, ...

Sadly, closest I have gotten thus far involves GLQuake and a hardware rasterizer module. And then the problem that performance is more limited
by trying to get geometry processed and fed into the module, than its ability to walk edges and rasterize stuff. Would seemingly ideally need something that could also perform transforms and deal with perspective- correct texture filtering (vs CPU side transforms, and dynamic
subdivision + affine texture filtering).

...

The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Andrew.

Humm... It makes me think of, well... does an atomic RMW have implied
membars, or are they completely separated akin to the SPARC membar instruction? LOCK'ed RMW on Intel, XCHG instruction aside wrt its
implied LOCK prefix, well, they are StoreLoad! Shit.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 13 07:37:46 2024

From Newsgroup: comp.arch

aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.

That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

Isn't this just reusing the normal forwarding network?

If not found, you do as usual and start a regular load operation, but
now you also know that you can skip the flushing of the same?

PS. I do agree that it is a good idea (even patent-worthy?), but not
brilliant since it is so very obvious in hindsight.

To me brilliant is something that still isn't obvious after larning
about it.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Nov 13 01:23:51 2024

From Newsgroup: comp.arch

On 11/12/2024 5:50 PM, Chris M. Thomasson wrote:

On 11/12/2024 3:02 PM, aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a >>>> STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.

That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad.

Very interesting... Need to ponder on this. Still running it using no
memory barrier via a RCU based algorithm has to be faster. Humm...
Membar free reads is very nice.

LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

Iirc, a sequential membar was strange on SPARC. I have seen things like
this before wrt RMO mode:

membar #StoreLoad | #LoadStore | #LoadLoad | #StoreStore

shit. It's been a while! damn.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 13 04:25:15 2024

From Newsgroup: comp.arch

The execution time of each is the same, and the main cost is the fence >>>> synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.

One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.

Andrew.

Humm... It makes me think of, well... does an atomic RMW have implied membars, or are they completely separated akin to the SPARC membar instruction? LOCK'ed RMW on Intel, XCHG instruction aside wrt its
implied LOCK prefix, well, they are StoreLoad! Shit.

I am not sure how atomic memory ops are implemented through AMBA / AXI.
I think AMBA / AXI is a very good bus to use. It turns out I have been
using a similar proprietary bus (FTA) for my project. I have been
working on an AXI bus bridge so that I can migrate my system to AXI.
In FTA bus there is a command field on the bus that allows atomic memory
ops to be specified. There does not seem to be an equivalent in AXI.
Unless perhaps the user tag field is used.
One thing about the AXI bus is I do not understand how the CAS
instruction is supported. In my bus CAS is supported with double data on
the bus. There are two data items that need to be supplied to the memory controller for CAS.
Another issue run into was FTA uses the response bus to send MSI
interrupts. I am thinking of using the AXI read response bus for this
purpose, by sending an ERR response for interrupts with the read data containing the interrupt info. But I do not know if AXI devices will get confused seeing a read response without any read address previously
supplied. I am assuming devices will be able to filter bus transactions
using transaction ids.

I have not been able to get the Q+ CPU to operate reliably in the FPGA,
so I am stuck without a system CPU. I have given some thought to just
using an FPGA with built in (ARM) CPU cores. I want to get working on peripheral cores.

I have been using the MIG controller in native mode (non-AXI) coupled
with a multi-port memory controller for access to DDR3 RAM. The MIG
controller can supply a lot of bandwidth, but it has some latency to it.
I measured it at 28 clocks IIRC. I think timing depends on the memory component too. But that is at the MIG controller frequency. In my case
200 MHz. At the CPU frequency it is much less. While there is latency, a
new request can be made almost every memory clock. To get a lot of
bandwidth requestors like the frame buffer request an entire scan-line
of data with back-to-back accesses to the MIG controller. The frame
buffer uses a burst of 50 accesses, so it takes around 80 memory clock
cycles. In terms of a 50 MHz CPU that would be only about 20 clocks. The
frame buffer uses about 10% of the memory bandwidth and supports
800x600x16bpp mode. I think it could support 1920x1080x16bpp video. But
I did not want to spend that much bandwidth on video.
The biggest issue I found with the MIG controller was specifying the
right memory component.
It is 900 MHz memory but seems to run okay at 800 MHz. 800 MHz was more conveniently matched to other clocks in the system.

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 13 10:20:12 2024

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

To me brilliant is something that still isn't obvious after larning
about it.

Why do you think it's less brilliant to recognize something obvious
that everybody else has overlooked?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From aph@aph@littlepinkcloud.invalid to comp.arch on Wed Nov 13 18:07:17 2024

From Newsgroup: comp.arch

jseigh <jseigh_es00@xemaps.com> wrote:

On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:

That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

Does ARM use acquire and release differently than everyone else?
I'm not sure where StoreLoad fits in with those.

Yes. LDAR and STLR, used together, are sequentially consistent. This
is a stronger guarantee than acquire and release.

Andrew.
--- Synchronet 3.20a-Linux NewsLink 1.114

From aph@aph@littlepinkcloud.invalid to comp.arch on Wed Nov 13 18:13:04 2024

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> wrote:

aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

Isn't this just reusing the normal forwarding network?

If not found, you do as usual and start a regular load operation, but
now you also know that you can skip the flushing of the same?

Yes. As long as the data in the store buffer doesn't overlap with what
you're about to write, you can ship the flushing.

PS. I do agree that it is a good idea (even patent-worthy?), but not brilliant since it is so very obvious in hindsight.

LOL! :-)

To me brilliant is something that still isn't obvious after larning
about it.

You have very high standards.

Andrew.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 13 13:40:32 2024

From Newsgroup: comp.arch

On 11/13/2024 4:20 AM, Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

To me brilliant is something that still isn't obvious after larning
about it.

Why do you think it's less brilliant to recognize something obvious
that everybody else has overlooked?

Yeah.

This is my feelings about some of the deficiencies of standard RISC-V.

The stuff I want added, and have added as experiments, is not exactly non-obvious. Saw at least one person doubting that it would make much difference (namely in the use of "make the immediate bigger" prefixes).

But, experimentally, it does make enough of a difference that it should
be worth considering, at least for performance-oriented use-cases
(likely, not really needed for microcontrollers, where the priority is
more "cheap CPU" rather than "fast CPU").

But, as I see it, if you can make binaries 40% smaller, and 35% faster,
this is something that should be worth considering.

As opposed to the C extension which IME seems to only give around a
25-30% size reduction, and (with a CPU design that only does superscalar
on properly aligned 32-bit instructions) actually makes performance
slightly worse.

Granted, having both jumbo prefixes and the 'C' extension being likely a
best case for code density (though, BGBCC doesn't yet support the 'C' extension, so I can't test this).

I am half tempted to move the RV jumbo prefixes from
...-100-kkkkk-00-11011 (ALUIW block)
To:
...-100-kkkkk-00-00111 (JALR block)

For "technical reasons" (well, would also clean up the encoding conflict
with an older/dropped "ADDIWU" instruction). TBD if worth the break in compatibility though (if I did so, might consider also claiming 1xx for
jumbo prefixes, say, to give an extra bit so that "JIMM+JIMM+LUI" could
have enough bits to encode F0..F31 as well, but there are other
possibilities for how to encode this).

Most of these features have historical precedent as well, so should in
theory be "safe" (similar sorts of prefixes existed in Transputer and
Java VM).

Granted, not found examples thus far in 1980s or 1990s RISC
architectures (these sorts of prefixes didn't really seem to start
appearing in RISC's until the early 2000s). Annoyingly, most precedent
for the use of prefixes and prefix instructions seems to be in terms of
CISC architectures.

The closest direct equivalent of the Jumbo_Imm prefix I am aware of
didn't appear until MicroBlaze, which is cutting it a little close (and
have yet to verify if it existed in the original version of MicroBlaze).
In any case, will probably be more safe in a few years (as MiceoBlaze
moves further outside of the 20 year window).

Register-Indexed Load/Store and similar were fairly widespread (80386,
ARM32, and others), so should be safe.

Can note that also, in BJX2, the general ideas behind WEX encoding also
had precedent (was in use in 1990s DSP architectures and similar), ...

Sometimes, there is an elegance in finding things sufficiently obvious
that it is more a question why it is not more widespread.

Or, avoiding things that require a non-trivial leap in logic, or pose difficulty in verifying the logic chains.

Though, arguably, in terms of precedent, something like RISC-V is
arguably fairly safe:
Its core ISA lacks anything that didn't already have precedent by the
early 1980s.

But, as I see it, pretty much anything that has precedent earlier than
~2004 should be safe (which, as I see it, should include things like
jumbo prefixes, etc).

...

There are, granted, potential gotchas, like the years of hassle that
S3TC and depth-fail shadows and similar caused.

Where, S3TC should have been invalid, as it wasn't substantially
different from what was already in common use in the 1980s.

Seemingly, main arguable "novel" feature it had was defining the
interpolated colors as 1/3 + 2/3 rather than 1-bit (A or B), or 3/8 +
5/8 (as in some earlier Apple image formats).

There was the "S2TC" workaround (just disallow interpolation entirely); theoretically though, someone could have just used DXT1/DXT5 mostly as
is, but then redefined the interpolation as 3/8 + 5/8 as "close enough"...

Similarly the depth-fail issue was also annoying. There was still
depth-pass though, but this had some annoying edge cases that required workarounds (the shadows would break if the camera was inside a shadow
volume, requiring a workaround).

Depth-fail shadows should also be safe now.

...

Well, and people can freely use FAT32, or (in theory) NTFS. Though, the
design of NTFS itself is a bigger impediment to using it; though with
some limited (newer features may still not be safe).

A person should also be able to do their own off-brand implementations
of x86-64 (*) and 32-bit ARM and Thumb/Thumb2.

*: The original form of x86-64 should be safe, would mostly need to omit
newer forms of SSE, and AVX, to be safe.

...

May not be obvious, but admittedly, I am more someone that tries to
avoid "novelty" (often things like cost/benefit concerns and historical precedent are given more weight).

- anton

--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Nov 13 13:08:38 2024

From Newsgroup: comp.arch

On 11/13/2024 10:07 AM, aph@littlepinkcloud.invalid wrote:

jseigh <jseigh_es00@xemaps.com> wrote:

On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:

That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

Does ARM use acquire and release differently than everyone else?
I'm not sure where StoreLoad fits in with those.

Yes. LDAR and STLR, used together, are sequentially consistent. This
is a stronger guarantee than acquire and release.

interesting!
--- Synchronet 3.20a-Linux NewsLink 1.114

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Thu Nov 14 06:24:32 2024

From Newsgroup: comp.arch

In article <YfxXO.384093$EEm7.56154@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

Do read B2.3 Definition of the Arm memory model. It's only 32 pages,
and very clearly defines the memory model.

Your definition of "clearly" differs from mine.

Look at Pick dependencies on page B2-239 and B2-240:
(I'm replacing complicating details with "blah blah" or "A, B, C", to
highlight the issue I want to point out)

---
Pick Basic dependency:
There is A, B, C, or a Pick dependency between E1 and E2
Pick Data dependency:
There is a Pick Basic dependency from E1 to E2 and blah blah.
Pick Address dependency:
There is a Pick Data dependency from E1 to E3 and E2 is blah blah
Pick Control dependency:
This is a Pick Basic dependency from E1 to E3 and E2 is blah blah
Pick Dependency:
There is a Pick Basic, Pick Address, Pick Data, or Pick Control
dependency from E1 to E2
---

This is completely circular, and never defines what "pick" is.

Even better, let's look at the actual words for Pick Basic Dependency:

---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory
from E1 to E2
b) E1 and E2 are the same effect
---

Using the words as they are written, if any of 1a, 1b, 2a, or 2a is
true, a Pick Basic dependency exists between E1 and E2. To give
background, E1 and E2 are any events (effects) and not necessarily in
program order (E2 could be before E1, and can be on another CPU, the
event numbering system is not defined to indicate program order and when
they want to say E1 is in program order before E2 it seems to always
explicitly say so, and there are LOTS of other places where they create
a third event, E3, which may be between E1 and E2). I'm using event interchangeably with effect since I think effect is a terrible term.

So by rule 1a by itself, a Pick Basic Dependency exists
between a Load instruction (an example of an Explicit Memory Read, I'm assuming, as best as I can tell, an Explicit Memory Read is not really
defined) and every other possible event in that system happening before or after that load.

So what does this mean? I literally have no idea what they are trying to
get at here.

If E1 and E2 are the "same effect", does that mean it's the same instruction/operation, or just the same type of operation (like two loads),
or what? If there was an "overview" summarizing ordering in English,
then it I could interpret the looseness better.

I want to make it clear that I don't want a formal grammar, I just think
this is a particularly poor way to try to present this information.

What it reads like to me like a bad one of those logic puzzles but with info missing: The cookie was eaten by someone wearing a red coat. Susan wears
a hat. Who ate the cookie?

Kent
--- Synchronet 3.20a-Linux NewsLink 1.114

From aph@aph@littlepinkcloud.invalid to comp.arch on Thu Nov 14 09:23:23 2024

From Newsgroup: comp.arch

Kent Dickey <kegs@provalid.com> wrote:

Even better, let's look at the actual words for Pick Basic Dependency:

---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory
from E1 to E2
b) E1 and E2 are the same effect

I don't understand this. However, here are the actual words:

Pick Basic dependency

A Pick Basic dependency from a read Register effect or read Memory
effect R1 to a Register effect or Memory effect E2 exists if one
of the following applies:

• There is a Dependency through registers and memory from R1 to E2.
• There is an Intrinsic Control dependency from R1 to E2.
• There is a Pick Basic dependency from R1 to an Effect E3 and
there is a Pick Basic dependency from E3 to E2.

Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.

Andrew.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Nov 14 10:36:11 2024

From Newsgroup: comp.arch

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

To me brilliant is something that still isn't obvious after larning
about it.

Why do you think it's less brilliant to recognize something obvious
that everybody else has overlooked?

I did not convey my intended meaning here, what I meant is that there
are levels of brilliance, even when being the first to recognize something.

Yeah, I have absolutely no issue with ideas that are only obvious in hindsight, they deserve praise. My real problem are those things that
are new, but only because of the environment, as in the idea would be
obvious to anyone "versed in the field".

I.e US vs Norwegian (European?) patent law.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Nov 14 10:41:14 2024

From Newsgroup: comp.arch

aph@littlepinkcloud.invalid wrote:

Terje Mathisen <terje.mathisen@tmsw.no> wrote:

aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

Isn't this just reusing the normal forwarding network?

If not found, you do as usual and start a regular load operation, but
now you also know that you can skip the flushing of the same?

Yes. As long as the data in the store buffer doesn't overlap with what
you're about to write, you can ship the flushing.

PS. I do agree that it is a good idea (even patent-worthy?), but not
brilliant since it is so very obvious in hindsight.

LOL! :-)

To me brilliant is something that still isn't obvious after larning
about it.

You have very high standards.

That is one of the reasons I never started a PhD track, I could never
find an area of study that I thought would be sufficiently ground-breaking.

The other reason is/was that my friend Andy "Crazy" Glew did try the PhD
route for several years and hit the same stumbling block vs his
advisors, and I know that Andy is an idea machine well beyond myself.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Nov 14 14:13:54 2024

From Newsgroup: comp.arch

On 11/14/2024 1:41 AM, Terje Mathisen wrote:

aph@littlepinkcloud.invalid wrote:

Terje Mathisen <terje.mathisen@tmsw.no> wrote:

aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't >>>> necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.

Isn't this just reusing the normal forwarding network?

If not found, you do as usual and start a regular load operation, but
now you also know that you can skip the flushing of the same?

Yes. As long as the data in the store buffer doesn't overlap with what
you're about to write, you can ship the flushing.

PS. I do agree that it is a good idea (even patent-worthy?), but not
brilliant since it is so very obvious in hindsight.

LOL! :-)

To me brilliant is something that still isn't obvious after larning
about it.

You have very high standards.

That is one of the reasons I never started a PhD track, I could never
find an area of study that I thought would be sufficiently ground-breaking.

The other reason is/was that my friend Andy "Crazy" Glew did try the PhD route for several years and hit the same stumbling block vs his
advisors, and I know that Andy is an idea machine well beyond myself.

I had the chance to converse with him (Andy) as well. Wonderful!
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Nov 14 14:21:32 2024

From Newsgroup: comp.arch

On 11/14/2024 1:23 AM, aph@littlepinkcloud.invalid wrote:

Kent Dickey <kegs@provalid.com> wrote:

Even better, let's look at the actual words for Pick Basic Dependency:

---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory >> from E1 to E2
b) E1 and E2 are the same effect

I don't understand this. However, here are the actual words:

Pick Basic dependency

A Pick Basic dependency from a read Register effect or read Memory
effect R1 to a Register effect or Memory effect E2 exists if one
of the following applies:

• There is a Dependency through registers and memory from R1 to E2.
• There is an Intrinsic Control dependency from R1 to E2.
• There is a Pick Basic dependency from R1 to an Effect E3 and
there is a Pick Basic dependency from E3 to E2.

Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.

data dependencies as in stronger than a Dec Alpha does not not honor
data dependent loads?

--- Synchronet 3.20a-Linux NewsLink 1.114

From aph@aph@littlepinkcloud.invalid to comp.arch on Thu Nov 14 23:20:02 2024

From Newsgroup: comp.arch

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

On 11/14/2024 1:23 AM, aph@littlepinkcloud.invalid wrote:

Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.

data dependencies as in stronger than a Dec Alpha does not not honor
data dependent loads?

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Andrew.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Nov 14 19:25:15 2024

From Newsgroup: comp.arch

On 11/14/2024 3:20 PM, aph@littlepinkcloud.invalid wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

On 11/14/2024 1:23 AM, aph@littlepinkcloud.invalid wrote:

Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.

data dependencies as in stronger than a Dec Alpha does not not honor
data dependent loads?

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Ahhhh! Thank you. Btw, agreed.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Nov 14 19:50:55 2024

From Newsgroup: comp.arch

On 11/9/2024 5:37 PM, Lawrence D'Oliveiro wrote:

On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

It reads better without explanation ...

Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

Oh yeah! LOL! Thanks.

Can anybody find any other example of any IBM engineer ever having a sense
of humour? Ever?

Anybody?

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 15 07:25:12 2024

From Newsgroup: comp.arch

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."

However, I don't think that the Alpha architects considered the Alpha
memory ordering to be an error, and probably still don't, just like
the ARM architects don't consider their memory model to be an error.
I am pretty sure that no Alpha implementation ever made use of the
lack of causality in the Alpha memory model, so they could have added
causality without outlawing existing implementations. That they did
not indicates that they thought that their memory model was right. An
advocacy paper for weak memory models [adve&gharachorloo95] came from
the same place as Alpha, so it's no surprise that Alpha specifies weak consistency.

@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo},
title = {Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
annote = {Gives an overview of architectural features of
shared-memory computers such as independent memory
banks and per-CPU caches, and how they make the (for
programmers) most natural consistency model hard to
implement, giving examples of programs that can fail
with weaker consistency models. It then discusses
several categories of weaker consistency models and
actual consistency models in these categories, and
which ``safety net'' (e.g., memory barrier
instructions) programmers need to use to work around
the deficiencies of these models. While the authors
recognize that programmers find it difficult to use
these safety nets correctly and efficiently, it
still advocates weaker consistency models, claiming
that sequential consistency is too inefficient, by
outlining an inefficient implementation (which is of
course no proof that no efficient implementation
exists). Still the paper is a good introduction to
the issues involved.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 15 03:17:22 2024

From Newsgroup: comp.arch

On 11/14/2024 11:25 PM, Anton Ertl wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."

[...]

Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not need
this. Iirc, akin to memory_order_consume in C++:

https://en.cppreference.com/w/cpp/atomic/memory_order

data dependent loads

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Nov 15 15:22:07 2024

From Newsgroup: comp.arch

On Fri, 15 Nov 2024 07:25:12 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."

Of course, it's not enough for SC.
What you said holds, for example, for TSO and even by some memory
ordering models that a weaker than TSO.
The points of SC is that in addition to that it requires for any two
stores by different agents to be observed in the same order by all
agents in the system, including those two.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Nov 15 15:24:59 2024

From Newsgroup: comp.arch

On Fri, 15 Nov 2024 03:17:22 -0800
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

On 11/14/2024 11:25 PM, Anton Ertl wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack
of guarantees in memory ordering is a bad idea, and so is ARM's:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously? Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in
the order specified by its program."

[...]

Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not
need this. Iirc, akin to memory_order_consume in C++:

https://en.cppreference.com/w/cpp/atomic/memory_order

data dependent loads

You response does not answer Anton's question.

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Nov 15 14:13:27 2024

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of >guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."

However, I don't think that the Alpha architects considered the Alpha
memory ordering to be an error, and probably still don't, just like
the ARM architects don't consider their memory model to be an error.
I am pretty sure that no Alpha implementation ever made use of the
lack of causality in the Alpha memory model, so they could have added >causality without outlawing existing implementations. That they did
not indicates that they thought that their memory model was right. An >advocacy paper for weak memory models [adve&gharachorloo95] came from
the same place as Alpha, so it's no surprise that Alpha specifies weak >consistency.

Perhaps one might ask Dr. Kessler?

https://acg.cis.upenn.edu/milom/cis501-Fall09/papers/Alpha21264.pdf

--- Synchronet 3.20a-Linux NewsLink 1.114

From jseigh@jseigh_es00@xemaps.com to comp.arch on Fri Nov 15 11:08:29 2024

From Newsgroup: comp.arch

On 11/15/2024 2:25 AM, Anton Ertl wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."

However, I don't think that the Alpha architects considered the Alpha
memory ordering to be an error, and probably still don't, just like
the ARM architects don't consider their memory model to be an error.
I am pretty sure that no Alpha implementation ever made use of the
lack of causality in the Alpha memory model, so they could have added causality without outlawing existing implementations. That they did
not indicates that they thought that their memory model was right. An advocacy paper for weak memory models [adve&gharachorloo95] came from
the same place as Alpha, so it's no surprise that Alpha specifies weak consistency.

@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo},
title = {Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
annote = {Gives an overview of architectural features of
shared-memory computers such as independent memory
banks and per-CPU caches, and how they make the (for
programmers) most natural consistency model hard to
implement, giving examples of programs that can fail
with weaker consistency models. It then discusses
several categories of weaker consistency models and
actual consistency models in these categories, and
which ``safety net'' (e.g., memory barrier
instructions) programmers need to use to work around
the deficiencies of these models. While the authors
recognize that programmers find it difficult to use
these safety nets correctly and efficiently, it
still advocates weaker consistency models, claiming
that sequential consistency is too inefficient, by
outlining an inefficient implementation (which is of
course no proof that no efficient implementation
exists). Still the paper is a good introduction to
the issues involved.}
}

- anton

Anybody doing that sort of programming, i.e. lock-free or distributed algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place. Strongly
consistent memory won't help incompetence.

Joe Seigh
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 15 17:19:34 2024

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Fri, 15 Nov 2024 07:25:12 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of
guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."

Of course, it's not enough for SC.
What you said holds, for example, for TSO and even by some memory
ordering models that a weaker than TSO.
The points of SC is that in addition to that it requires for any two
stores by different agents to be observed in the same order by all
agents in the system, including those two.

That's included in the statement I cited: stores are operations, and
the behaviour is the same as executing all thoperations in some
sequential order. I.e., all processors observe everything they
observe with the same results. I have this definition from <https://en.wikipedia.org/wiki/Sequential_consistency>, which cites
the following source for it: Leslie Lamport, "How to Make a
Multiprocessor Computer That Correctly Executes Multiprocess
Programs", IEEE Trans. Comput. C-28,9 (Sept. 1979), 690-691.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 15 17:27:37 2024

From Newsgroup: comp.arch

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed >algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 15 12:42:15 2024

From Newsgroup: comp.arch

On 11/15/2024 5:24 AM, Michael S wrote:

On Fri, 15 Nov 2024 03:17:22 -0800
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

On 11/14/2024 11:25 PM, Anton Ertl wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack
of guarantees in memory ordering is a bad idea, and so is ARM's:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in
the order specified by its program."

[...]

Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not
need this. Iirc, akin to memory_order_consume in C++:

https://en.cppreference.com/w/cpp/atomic/memory_order

data dependent loads

You response does not answer Anton's question.

I guess not. Shit happens. ;^o
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 15 12:48:46 2024

From Newsgroup: comp.arch

On 11/15/2024 9:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

Well, if one can't handle the memory barriers, say wrt
std::memory_order_* ala C++ . well. that is a problem wrt creating these "exotic" types of algorithms. Imvvho, that is.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

For instance, a 100% sequential memory order won't help you with, say,
solving ABA.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Fri Nov 15 14:53:00 2024

From Newsgroup: comp.arch

On 11/15/2024 11:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

In my case, as I see it:
The tradeoff is more about implementation cost, performance, etc.

Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
...

However, local to the CPU core:
Not respecting things like RAW hazards does not seem well advised.

Like, if we store to a location, and then immediately read back from it,
one can expect to see the most recently written value, not the previous
value. Or, if one stores to two adjacent memory locations, one expects
that both stores write the data correctly.

Granted, it is a tradeoff:
Not bothering: Fast, Cheap, but may break expected behavior;
Could naively use NOPs if aliasing is possible, but this is bad.
Add an interlock check, stall the pipeline if it happens:
Works, but can add a noticeable performance penalty;
My attempts at 75 and 100 MHz cores had often done this;
Sadly, memory RAW and WAW hazards are not exactly rare.
Use internal forwarding, so written data is used directly next cycle.
Better performance;
But, has a fairly high cost for the FPGA (*1).

*1: This factor (along with L1 cache sizes) weighs in heavily to why I continue to use 50MHz. Otherwise, I could use 75 MHz, but this internal forwarding logic, and L1 caches with 32K of BRAM (excluding metadata)
and 1-cycle access, are not really viable at 75 MHz.

For the L2 cache, which is much bigger, one can use a few extra
pad-cycles to access the Block-RAM array. Though, 5 cycle latency for Load/Store operations would be, not good.

Can note that with Block-RAM, usual behavior seems to be that if one
tries to read from one port while writing to another port on the same
clock edge, if both are at the same location, the prior contents will be returned. This may be a general behavior in Verilog though, rather than
a Block-RAM thing (also seems to apply to LUTRAM if accessed in the same pattern; though LUTRAM allows also reading the value via combinatorial
logic rather than a clock-edge, which seems to always return the value
from the most recent clock-edge).

As I can note, a 4K or 8K L1 cache with stall on RAW or WAW, at 75 MHz,
tends to perform worse IME, than a 32K cache running at 50 MHz with no
RAW/WAW stall.

Also, trying to increase MHz by increasing instruction latency in many
cases was also not ideal for performance.

Granted, if I were to do things the "DEC Alpha" way, I probably could
run stuff at 75MHz, but then would likely need the compiler to insert a
bunch of strategic NOPs so that the program doesn't break.

For memory ordering, possibly, in my case a case could be made for an
"order respecting DRAM cache" via the MMIO interface, say:
F000_01000000..F000_3FFFFFFF

Could be defined to alias with the main RAM map, but with strictly
sequential ordering for every memory access across all cores (at the
expense of performance).

Where:
0000_00000000..7FFF_FFFFFFFF: Virtual Address Space
8000_00000000..BFFF_FFFFFFFF: Supervisor-Only Virtual Address Space
C000_00000000..CFFF_FFFFFFFF: Physical Address Space, Default Caching
D000_00000000..DFFF_FFFFFFFF: Physical Address Space, Volatile/NoCache
E000_00000000..EFFF_FFFFFFFF: Reserved
F000_00000000..FFFF_FFFFFFFF: MMIO Space

MMIO space is currently fully independent of RAM space.

However, at present:
FFFF_F0000000..FFFF_FFFFFFFF: MMIO Space, as Used for MMIO devices.

So, in theory, remerging RAM IO space into MMIO Space would be possible
(well, except that trying to access HW MMIO address ranges via RAM-space access would likely be disallowed).

Can note, MMU disabled:
0000_00000000..0FFF_FFFFFFFF: Same as C000..CFFF space.
1000_00000000..7FFF_FFFFFFFF: Invalid

...

Granted, current scheme does set a limit of 16TB of RAM.
But, biggest FPGA boards I have only have 256MB, so, ...

And, current VA map within TestKern (from memory):
0000_00000000..0000_00FFFFFF: NULL Space
0000_01000000..0000_3FFFFFFF: RAM Range (Identity Mapped)
0000_40000000..0000_BFFFFFFF: Direct Page Mapping (no swap)
0001_00000000..3FFF_FFFFFFFF: Mapped to swapfile, Global
4000_00000000..7FFF_FFFFFFFF: Process Local

Note that, within the RAM-range, the RAM will wrap around. The specifics
of the wraparound are used to detect RAM size (this would set an
effective limit at 512MB, after which no wraparound would be detected).

Specifics here would need to change if larger RAM sizes were supported.

Not sure how RAM size is detected with DIMM modules. IIRC, with PCs, it
was more probe along linearly until one finds an address that no longer returns valid data (say, if one hits the 1GB mark, and gets back 000000
or FFFFFFF or similar, assume end of RAM at 1GB).

One does need to make sure caches (including L2 cache) are flushed
during all this, as the caches doing their usual cache thing, may
incorrectly detect larger RAM than actually exists.

...

- anton

--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 15 14:05:57 2024

From Newsgroup: comp.arch

On 11/15/2024 12:53 PM, BGB wrote:

On 11/15/2024 11:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

In my case, as I see it:
The tradeoff is more about implementation cost, performance, etc.

Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
...

[...]

A TSO from a weak memory model is as it is. It should not necessarily
perform "worse" than other systems that have TSO as a default. The
weaker models give us flexibility. Any weak memory model should be able
to give sequential consistency via using the right membars in the right places.

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Fri Nov 15 17:35:22 2024

From Newsgroup: comp.arch

On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:

On 11/15/2024 12:53 PM, BGB wrote:

On 11/15/2024 11:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't >>>> be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

In my case, as I see it:
   The tradeoff is more about implementation cost, performance, etc.

Weak model:
   Cheaper (and simpler) to implement;
   Performs better when there is no need to synchronize memory;
   Performs worse when there is need to synchronize memory;
   ...

[...]

A TSO from a weak memory model is as it is. It should not necessarily perform "worse" than other systems that have TSO as a default. The
weaker models give us flexibility. Any weak memory model should be able
to give sequential consistency via using the right membars in the right places.

The speed difference is mostly that, in a weak model, the L1 cache
merely needs to fetch memory from the L2 or similar, may write to it
whenever, and need not proactively store back results.

As I understand it, a typical TSO like model will require, say:
Any L1 cache that wants to write to a cache line, needs to explicitly
request write ownership over that cache line;
Any attempt by other cores to access this line, may require the L2 cache
to send a message to the core currently holding the cache line for
writing to write back its contents, with the request unable to be
handled until after the second core has written back the dirty cache line.

This would create potential for significantly more latency in cases
where multiple cores touch the same part of memory; albeit the cores
will see each others' memory stores.

So, initially, weak model can be faster due to not needing any
additional handling.

But... Any synchronization points, such as a barrier or locking or
releasing a mutex, will require manually flushing the cache with a weak
model. And, locking/releasing the mutex itself will require a mechanism
that is consistent between cores (such as volatile atomic swaps or
similar, which may still be weak as a volatile-atomic-swap would still
not be atomic from the POV of the L2 cache; and an MMIO interface could
be stronger here).

Seems like there could possibly be some way to skip some of the cache
flushing if one could verify that a mutex is only being locked and
unlocked on a single core.

Issue then is how to deal with trying to lock a mutex which has thus far
been exclusive to a single core. One would need some way for the core
that last held the mutex to know that it needs to perform an L1 cache flush.

Though, one possibility could be to leave this part to the OS scheduler/syscall/... mechanism; so the core that wants to lock the
mutex signals its intention to do so via the OS, and the next time the
core that last held the mutex does a syscall (or tries to lock the mutex again), the handler sees this, then performs the L1 flush and flags the
mutex as multi-core safe (at which point, the parties will flush L1s at
each mutex lock, though possibly with a timeout count so that, if the
mutex has been single-core for N locks, it reverts to single-core behavior).

This could reduce the overhead of "frivolous mutex locking" in programs
that are otherwise single-threaded or single processor (leaving the
cache flushes for the ones that are in-fact being used for
synchronization purposes).

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Nov 16 00:51:36 2024

From Newsgroup: comp.arch

On Fri, 15 Nov 2024 23:35:22 +0000, BGB wrote:

On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:

On 11/15/2024 12:53 PM, BGB wrote:

On 11/15/2024 11:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

In my case, as I see it:
   The tradeoff is more about implementation cost, performance, etc.

Weak model:
   Cheaper (and simpler) to implement;
   Performs better when there is no need to synchronize memory;
   Performs worse when there is need to synchronize memory;
   ...

[...]

A TSO from a weak memory model is as it is. It should not necessarily
perform "worse" than other systems that have TSO as a default. The
weaker models give us flexibility. Any weak memory model should be able
to give sequential consistency via using the right membars in the right
places.

The speed difference is mostly that, in a weak model, the L1 cache
merely needs to fetch memory from the L2 or similar, may write to it whenever, and need not proactively store back results.

As I understand it, a typical TSO like model will require, say:
Any L1 cache that wants to write to a cache line, needs to explicitly
request write ownership over that cache line;

The cache line may have been fetched from a core which modified the
data, and handed this line directly to this requesting core on a
typical read. So, it is possible for the line to show up with
write permission even if the requesting core did not ask for write
permission. So, not all lines being written have to request owner-
ship.

Any attempt by other cores to access this line,

You are being rather loose with your time analysis in this question::

Access this line before write permission has been requested,
or
Access this line after write permission has been requested but
before it has arrived,
or
Access this line after write permission has arrived.

may require the L2 cache
to send a message to the core currently holding the cache line for
writing to write back its contents, with the request unable to be
handled until after the second core has written back the dirty cache
line.

L2 has to know something about how L1 has the line, and likely which
core cache the data is in.

This would create potential for significantly more latency in cases
where multiple cores touch the same part of memory; albeit the cores
will see each others' memory stores.

One can ARGUE that this is a good thing as it makes latency part
of the memory access model. More interfering accesses=higher
latency.

So, initially, weak model can be faster due to not needing any
additional handling.

But... Any synchronization points, such as a barrier or locking or
releasing a mutex, will require manually flushing the cache with a weak model.

Not necessarily:: My 66000 uses causal memory consistency, yet when
an ATOMIC event begins it reverts to sequential consistency until
the end of the event where it reverts back to causal. Use of MMI/O
space reverts to sequential consistency, while access to config
space reverts all the way back to strongly ordered.

And, locking/releasing the mutex itself will require a mechanism
that is consistent between cores (such as volatile atomic swaps or
similar, which may still be weak as a volatile-atomic-swap would still
not be atomic from the POV of the L2 cache; and an MMIO interface could
be stronger here).

Seems like there could possibly be some way to skip some of the cache flushing if one could verify that a mutex is only being locked and
unlocked on a single core.

Issue then is how to deal with trying to lock a mutex which has thus far
been exclusive to a single core. One would need some way for the core
that last held the mutex to know that it needs to perform an L1 cache
flush.

This seems to be a job for Cache Consistency.

Though, one possibility could be to leave this part to the OS scheduler/syscall/...

The OS wants nothing to do with this.

mechanism; so the core that wants to lock the
mutex signals its intention to do so via the OS, and the next time the
core that last held the mutex does a syscall (or tries to lock the mutex again), the handler sees this, then performs the L1 flush and flags the
mutex as multi-core safe (at which point, the parties will flush L1s at
each mutex lock, though possibly with a timeout count so that, if the
mutex has been single-core for N locks, it reverts to single-core
behavior).

This could reduce the overhead of "frivolous mutex locking" in programs
that are otherwise single-threaded or single processor (leaving the
cache flushes for the ones that are in-fact being used for
synchronization purposes).

....

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Sat Nov 16 00:39:42 2024

From Newsgroup: comp.arch

On 11/15/2024 6:51 PM, MitchAlsup1 wrote:

On Fri, 15 Nov 2024 23:35:22 +0000, BGB wrote:

On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:

On 11/15/2024 12:53 PM, BGB wrote:

On 11/15/2024 11:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed >>>>>> algorithms, who can't handle weakly consistent memory models,
shouldn't
be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

In my case, as I see it:
   The tradeoff is more about implementation cost, performance, etc. >>>>
Weak model:
   Cheaper (and simpler) to implement;
   Performs better when there is no need to synchronize memory;
   Performs worse when there is need to synchronize memory;
   ...

[...]

A TSO from a weak memory model is as it is. It should not necessarily
perform "worse" than other systems that have TSO as a default. The
weaker models give us flexibility. Any weak memory model should be able
to give sequential consistency via using the right membars in the right
places.

The speed difference is mostly that, in a weak model, the L1 cache
merely needs to fetch memory from the L2 or similar, may write to it
whenever, and need not proactively store back results.

As I understand it, a typical TSO like model will require, say:
Any L1 cache that wants to write to a cache line, needs to explicitly
request write ownership over that cache line;

The cache line may have been fetched from a core which modified the
data, and handed this line directly to this requesting core on a
typical read. So, it is possible for the line to show up with
write permission even if the requesting core did not ask for write permission. So, not all lines being written have to request owner-
ship.

OK.

I think the bigger distinction, is more that a concept of write
ownership exists in the first place...

In my current memory model, there is no concept of write ownership.

Ironically, this also means the RISC-V LR/SC instructions don't make
sense in my memory model, but this hasn't been a huge loss (they just
sort of behave as-if they worked).

Any attempt by other cores to access this line,

You are being rather loose with your time analysis in this question::

Access this line before write permission has been requested,
or
Access this line after write permission has been requested but
before it has arrived,
or
Access this line after write permission has arrived.

Yeah. I didn't really distinguish these cases...

May possibly be different in a cache system where events are processed sequentially, rather than circling around in a ring bus (and processed
in whatever way the requests happen to hit the L2 cache or similar).

Say, request comes in for address 123 from core B:
Write ownership held by A?
Send request to A to Flush 123;
Flag 123 as the flush having been requested;
To avoid repeating the request.
Ignore B's request for now (it then circles the bus);
Write ownership not held?
If the request was for write privilege:
Mark as held by B;
Send response to B's request.

If A receives a flush request:
Flush the cache line in question;
Write modified data, or sense an FLUSH_ACK response or similar.

When L2 receives response:
Write data back to L2 if needed;
Mark cache line as no longer held.

Less obvious what happens if an L2 miss happens and the line at that
location is still held.
Would presumably need all cores to flush any dirty lines before they
could be safely evicted from the L2 cache (in my current design, this
scenario is ignored).

                                               may require the L2 cache
to send a message to the core currently holding the cache line for
writing to write back its contents, with the request unable to be
handled until after the second core has written back the dirty cache
line.

L2 has to know something about how L1 has the line, and likely which
core cache the data is in.

Yeah.

More bookkeeping needed here...

Possibly though, L2 may not need to track the specific core, if it can
send out a general message:
"Whoever holds line 123 needs to flush it."

Message then has a special behavior in that it circles the whole bus
without taking any shortcut paths, and is then removed once it gets back around to the L2 cache (after presumably every other node on the bus has
seen it), and or gets replaced by the appropriate ACK (if it hits an L1
cache that is holding the line in question).

Specifics likely to differ here between a message-ring bus, and other
types of bus.

Possibly the comparably high latency of a message ring would not be
ideal in this case.

One other possibility for a bus could be be a star-network, where
message can either be point-to-point or broadcast. Say, point-to-point
being used if both locations have a known address, and broadcast
messages sent to every node on the bus.

Unclear if "hubs" on this bus would either need to know which "ports" correspond to which node address ranges, or simply broadcast any
incoming message on all ports. Broadcast with no buffering would be cheapest/simplest, but would have overhead, and a potential for
"collision" (where two nodes send a message at the same time, but don't
yet see the other's message).

Likely, each hub would need a FIFO and basic message routing, but this
would add cost (per-node cost is likely to be higher than that of
forwarding messages along a ring).

But, there could be merit, say, if messages could get anywhere on the
bus within a relatively small number of clock cycles.

This would create potential for significantly more latency in cases
where multiple cores touch the same part of memory; albeit the cores
will see each others' memory stores.

One can ARGUE that this is a good thing as it makes latency part
of the memory access model. More interfering accesses=higher
latency.

OK.

So, initially, weak model can be faster due to not needing any
additional handling.

But... Any synchronization points, such as a barrier or locking or
releasing a mutex, will require manually flushing the cache with a weak
model.

Not necessarily:: My 66000 uses causal memory consistency, yet when
an ATOMIC event begins it reverts to sequential consistency until
the end of the event where it reverts back to causal. Use of MMI/O
space reverts to sequential consistency, while access to config
space reverts all the way back to strongly ordered.

In my case, RAM like and MMIO use different messaging protocols...

Not currently any scheme in place to support consistency modeling for
RAM like access.

MMIO is ordered mostly as the L1 cache will not let anything more happen
until it gets a response (so, the L1 cache forces sequential operation
on its end). On the other end, the bridge to the MMIO bus will become
"busy" and not respond to any more requests until the currently active
request has been completed (so, it is a serialized "first come, first
serve" as far as message arrival on the ringbus).

Atomic operations on the bus could likely be formed as a special form of
MMIO SWAP request (with a few bits somewhere used to encode which
operator to perform). Well, unless the only supported atomic operator is
SWAP.

Likely it would depend on the target device for whether or not atomic operators are allowed.

       And, locking/releasing the mutex itself will require a mechanism
that is consistent between cores (such as volatile atomic swaps or
similar, which may still be weak as a volatile-atomic-swap would still
not be atomic from the POV of the L2 cache; and an MMIO interface could
be stronger here).

Seems like there could possibly be some way to skip some of the cache
flushing if one could verify that a mutex is only being locked and
unlocked on a single core.

Issue then is how to deal with trying to lock a mutex which has thus far
been exclusive to a single core. One would need some way for the core
that last held the mutex to know that it needs to perform an L1 cache
flush.

This seems to be a job for Cache Consistency.

Possibly so...

Though, one possibility could be to leave this part to the OS
scheduler/syscall/...

The OS wants nothing to do with this.

Unclear how to best deal with it...

Status quo:
Lock/release using system calls;
System calls always perform L1 flush
( ... if there were more than 1 core ... ).

Faster:
Lock/Release handled purely in userland;
Delay or avoid cache flushes.

Hybrid:
Try to have a fast-path in userland ("local core only" mutexes);
Fall back to syscalls if not fast-path.

Lazy hybrid:
Lock/release continue using system calls;
Nothing changes as far as userland cares.
Try to delay the L1 flushes.
Say, to save the ~20k clock-cycles this process eats.

Lazy flushing on syscalls and scheduler events seems possible, as
(assuming the core isn't frozen) this will happen eventually.

Does mean a scenario can occur (where a previously assumed local-only
mutex is in-fact non-local) could take an unreasonably long time to deal
with (one core needing to wait until the other core does a system call
or similar).

Note that if a mutex lock happens, and can't be handled immediately,
general behavior is to mark the task as waiting on a mutex and then
switch to a different task (this is otherwise similar to how calls like "usleep()" are handled, task can be resumed once mutex is no longer held).

Though, for now, TestKern is still purely single-processor.

But, not much motivation to invest in multicore TestKern when I can
still generally only fit a single core on the XC7A100T.

Can at least go dual core on the XC7A200T, but hadn't really made any
use of it (so the second core sits around mostly doing nothing in this
case).

Where, in the single core case, no real way to handle mutexes other than
to reschedule the task.

So, some of this is still kinda theoretical.

Admittedly, it wasn't until fairly recently that TestKern got preemptive
task scheduling. And, even then, there were still a lot of
race-condition type bugs early on (well, partly stemming from the
general lack of mutexes in many cases; and pretty much entirely absent
in the kernel because, as-is, there is no way to actually resolve a
mutex conflict in the kernel should one occur...).

Well, and for userland, I ended up with generally using "reschedule on syscalls" rather than "reschedule on timer IRQ", as "reschedule on
syscalls" was slightly less prone to result in the sorts of race
conditions that caused stuff to break (only uses timer IRQ as a fallback
if the task has managed to hold the CPU for an unreasonable amount of time).

But, yeah, a lot is still "in theory" for now, actual state of TestKern
still kinda sucks on this front...

                      mechanism; so the core that wants to lock the
mutex signals its intention to do so via the OS, and the next time the
core that last held the mutex does a syscall (or tries to lock the mutex
again), the handler sees this, then performs the L1 flush and flags the
mutex as multi-core safe (at which point, the parties will flush L1s at
each mutex lock, though possibly with a timeout count so that, if the
mutex has been single-core for N locks, it reverts to single-core
behavior).

This could reduce the overhead of "frivolous mutex locking" in programs
that are otherwise single-threaded or single processor (leaving the
cache flushes for the ones that are in-fact being used for
synchronization purposes).

The cost of mutex locking could almost be ignored...

Until of course people are trying to use otherwise frivolous mutex locks
to protect things that are only ever accessed by a single thread (as has
sort of become the style in many codebases), etc.

Or, say, burning extra clock-cycles in the name of "malloc()" being thread-safe (even if, much of the time, the mutex hiding inside the malloc/free calls or similar isn't actually protecting anything).

....

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 16 07:37:44 2024

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/15/2024 9:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

For instance, a 100% sequential memory order won't help you with, say, >solving ABA.

Sure, not all problems are solved by sequential consistency, and yes,
it won't solve race conditions like the ABA problem. But jseigh
implied that finding it easier to write correct and efficient code for sequential consistency than for a weakly-consistent memory model
(e.g., Alphas memory model) is incompetent.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 16 07:46:17 2024

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

The tradeoff is more about implementation cost, performance, etc.

Yes. And the "etc." includes "ease of programming".

Weak model:
Cheaper (and simpler) to implement;

Yes.

Performs better when there is no need to synchronize memory;

Not in general. For a cheap multiprocessor implementation, yes. A sophisticated implementation of sequential consistency can just storm
ahead in that case and achieve the same performance. It just has to
keep checkpoints around in case that there is a need to synchronize
memory.

Performs worse when there is need to synchronize memory;

With a cheap multiprocessor implementation, yes. In general, no: Any sequentially consistent implementation is also an implementation of
every weaker memory model, and the memory barriers become nops in that
kind of implementation. Ok, nops still have a cost, but it's very
close to 0 on a modern CPU.

Another potential performance disadvantage of sequential consistency
even with a sophisticated implementation:

If you have some algorithm that actually works correctly even when it
gets stale data from a load (with some limits on the staleness), the sophisticated SC implementation will incur the latency coming from
making the load non-stale while that latency will not occur or be less
in a similarly-sophisticated implementation of an appropriate weak
consistency model.

However, given that the access to actually-shared memory is slow even
on weakly-consistent hardware, software usually takes measures to
avoid having a lot of such accesses, so that cost will usually be
miniscule.

What you missed: the big cost of weak memory models and cheap hardware implementations of them is in the software:

* For correctness, the safe way is to insert a memory barrier between
any two memory operations.

* For performance (on cheap implementations of weak memory models) you
want to execute as few memory barriers as possible.

* You cannot use testing to find out whether you have enough (and the
right) memory barriers. That's not only because the involved
threads may not be in the right state during testing for uncovering
the incorrectness, but also because the hardware used for testing
may actually have stronger consistency than the memory model, and so
some kinds of bugs will never show up in testing on that hardware,
even when the threads reach the right state. And testing is still
the go-to solution for software people to find errors (nowadays even
glorified by continuous integration and modern fuzz testing
approaches).

The result is that a lot of software dealing with shared memory is
incorrect because it does not have a memory barrier that it should
have, or inefficient on cheap hardware with expensive memory barriers
because it uses more memory barriers than necessary for the memory
model. A program may even be incorrect in one place and have
superflouous memory barriers in another one.

Or programmers just don't do this stuff at all (as advocated by
jseigh), and instead just write sequential programs, or use bottled
solutions that often are a lot more expensive than superfluous memory
barriers. E.g., in Gforth the primary inter-thread communication
mechanism is currently implemented with pipes, involving the system
calls read() and write(). And Bernd Paysan who implemented that is a
really good programmer; I am sure he would be able to wrap his head
around the whole memory model stuff and implement something much more efficient, but that would take time that he obviously prefers to spend
on more productive things.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 16 08:58:40 2024

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification?

...

Perhaps one might ask Dr. Kessler?

https://acg.cis.upenn.edu/milom/cis501-Fall09/papers/Alpha21264.pdf

I don't think that anything in the 21264 core would result in the
Alpha-unique inconsistency; the only core mechanisms that I can think
of where that would be relevant is value prediction, and the 21264
does not do that.

Looking at the memory subsystems of bigger Alpha systems might be more relevant.

There is a good reason to suspect that the Alpha architects imagined
hardware that actually did not appear: They did not specify hardware
byte and 16-bit memory accesses with the justification that a
first-level write-back cache would require ECC in DEC machines, and
ECC for bytes (or read-modify-write for keeping ECC on larger units)
is supposedly too expensive. However, the Alphas without BWX
instructions (everything up to EV5, but EV56 and later acquired BWX)
never had a first-level write-back cache.

And the EV6 which has a first-level write-back cache, implements the
BWX instructions, so the reasoning against BWX obviously does not hold
water. Reading on page 31 of the paper above, the 21264 (EV6) uses read-modify-write for updating the ECC data.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 16 13:21:48 2024

From Newsgroup: comp.arch

On 11/15/2024 12:42 PM, Chris M. Thomasson wrote:

On 11/15/2024 5:24 AM, Michael S wrote:

On Fri, 15 Nov 2024 03:17:22 -0800
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

On 11/14/2024 11:25 PM, Anton Ertl wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack
of guarantees in memory ordering is a bad idea, and so is ARM's:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in
the order specified by its program."

[...]

Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not
need this. Iirc, akin to memory_order_consume in C++:

https://en.cppreference.com/w/cpp/atomic/memory_order

data dependent loads

You response does not answer Anton's question.

I guess not. Shit happens. ;^o

Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 16 13:23:41 2024

From Newsgroup: comp.arch

On 11/16/2024 1:21 PM, Chris M. Thomasson wrote:

On 11/15/2024 12:42 PM, Chris M. Thomasson wrote:

On 11/15/2024 5:24 AM, Michael S wrote:

On Fri, 15 Nov 2024 03:17:22 -0800
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

On 11/14/2024 11:25 PM, Anton Ertl wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do >>>>>> that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack
of guarantees in memory ordering is a bad idea, and so is ARM's:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously? >>>>> Sequential consistency can be specified in one sentence: "The result >>>>> of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in
the order specified by its program."

[...]

Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not
need this. Iirc, akin to memory_order_consume in C++:

https://en.cppreference.com/w/cpp/atomic/memory_order

data dependent loads

You response does not answer Anton's question.

I guess not. Shit happens. ;^o

Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.

I think, iirc, there is a way to use an acquire membar on the loading of
the initial node of a collection to iterate it without using memory_order_consume for every node. I might be wrong on that. It's been
a while!
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 16 13:28:06 2024

From Newsgroup: comp.arch

On 11/15/2024 10:39 PM, BGB wrote:
[...]

Hybrid:
Try to have a fast-path in userland ("local core only" mutexes);
Fall back to syscalls if not fast-path.

Are you familiar with adaptive mutex logic? I know a lot about
fast-paths in userland before they need to wait in the kernel on a
contended mutex, or empty condition for a lock-free stack or something...

--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Sat Nov 16 21:59:01 2024

From Newsgroup: comp.arch

On Fri, 15 Nov 2024 07:25:12 GMT, Anton Ertl wrote:

@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo}, title =
{Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
...

Available online at Bitsavers <http://bitsavers.trailing-edge.com/pdf/dec/tech_reports/WRL-95-7.pdf>
and mirrors.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 16 14:10:15 2024

From Newsgroup: comp.arch

On 11/15/2024 11:37 PM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/15/2024 9:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't >>>> be doing that sort of programming in the first place.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

For instance, a 100% sequential memory order won't help you with, say,
solving ABA.

Sure, not all problems are solved by sequential consistency, and yes,
it won't solve race conditions like the ABA problem. But jseigh
implied that finding it easier to write correct and efficient code for sequential consistency than for a weakly-consistent memory model
(e.g., Alphas memory model) is incompetent.

What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using std::memory_order_seq_cst", well, that is a violation right off the bat.

Fair enough?
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 16 22:28:12 2024

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Yeah, I have absolutely no issue with ideas that are only obvious in hindsight, they deserve praise. My real problem are those things that
are new, but only because of the environment, as in the idea would be obvious to anyone "versed in the field".

"skilled in the art" (which has a nice ring to it) is the technical term
in English (see
https://www.epo.org/en/legal/guidelines-epc/2024/g_vii_3.html ).

But presence or lack of an inventive step can be quite difficult.
Examiners have argued that "This solution is so simple, somebody
must have come across it before"...

In one particular case, a colleague and myself were cooperating
closely on a related group of inventions. It was quite amusing
that one particular point was quite obvious to him, which I found
out when I told him about my "new" finding. Even more amusing
was that the same thing happened vice versa - something that was
completely obvious to me wasn't obvious to him at all, and needed
an explanation.

I.e US vs Norwegian (European?) patent law.

I think the US has now gotten closer in patent law to what the
rest of the world is doing.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Sat Nov 16 17:28:21 2024

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

It isn't a claim, just an opinion.
--- Synchronet 3.20a-Linux NewsLink 1.114

From jseigh@jseigh_es00@xemaps.com to comp.arch on Sun Nov 17 09:03:06 2024

From Newsgroup: comp.arch

On 11/16/24 16:21, Chris M. Thomasson wrote:

Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.

That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.

Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect. Or maybe disable reordering or optimization altogether
for those target architectures.

Joe Seigh

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 17 15:15:08 2024

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/15/2024 11:37 PM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/15/2024 9:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

For instance, a 100% sequential memory order won't help you with, say,
solving ABA.

Sure, not all problems are solved by sequential consistency, and yes,
it won't solve race conditions like the ABA problem. But jseigh
implied that finding it easier to write correct and efficient code for
sequential consistency than for a weakly-consistent memory model
(e.g., Alphas memory model) is incompetent.

What if you had to write code for a weakly ordered system, and the >performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using >std::memory_order_seq_cst", well, that is a violation right off the bat.

Fair enough?

Are you trying to support my point?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 17 15:17:52 2024

From Newsgroup: comp.arch

jseigh <jseigh_es00@xemaps.com> writes:

Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.

That's something between the user of a programming language and the
compiler. If you use a programming language or compiler that gives
weaker memory ordering guarantees than the architecture it compiles
to, that's your choice. Nothing forces compilers to behave that way,
and it's actually easier to write compilers that do not do such
reordering.

Or maybe disable reordering or optimization altogether
for those target architectures.

So you want to throw out the baby with the bathwater.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Nov 17 13:30:10 2024

From Newsgroup: comp.arch

On 11/17/2024 6:03 AM, jseigh wrote:

On 11/16/24 16:21, Chris M. Thomasson wrote:

Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction.
Iirc, mb for alpha? Cannot remember that one right now.

That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.

Strange! C++ has:

https://en.cppreference.com/w/cpp/atomic/atomic_signal_fence

That only deals with compilers, not the arch memory order... Humm...

Interesting Joe!

Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect. Or maybe disable reordering or optimization altogether
for those target architectures.

Indeed. The compiler needs to know about these things. Iirc, there was
an old post over c.p.t that deals with a compiler (think it was GCC)
that messed up a pthread try lock for a mutex. It's a very old post. But
I remember it for sure.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Nov 17 15:34:08 2024

From Newsgroup: comp.arch

On 11/17/2024 7:15 AM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/15/2024 11:37 PM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/15/2024 9:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed >>>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>>> be doing that sort of programming in the first place.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

For instance, a 100% sequential memory order won't help you with, say, >>>> solving ABA.

Sure, not all problems are solved by sequential consistency, and yes,
it won't solve race conditions like the ABA problem. But jseigh
implied that finding it easier to write correct and efficient code for
sequential consistency than for a weakly-consistent memory model
(e.g., Alphas memory model) is incompetent.

What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using
std::memory_order_seq_cst", well, that is a violation right off the bat.

Fair enough?

Are you trying to support my point?

I am trying to say you might not be hired if you only knew how to handle std::memory_order_seq_cst wrt C++... ?
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Nov 18 07:11:04 2024

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using
std::memory_order_seq_cst", well, that is a violation right off the bat. ...

I am trying to say you might not be hired if you only knew how to handle >std::memory_order_seq_cst wrt C++... ?

I am not looking to be hired.

In any case, this cuts both ways: If you are an employer working on multi-threaded software, say, for Windows or Linux, will you reduce
your pool of potential hires by including a requirement like the one
above? And then pay for longer development time and additional
hard-to-find bugs coming from overshooting the requirement you stated
above. Or do you limit your software support to TSO hardware (for
lack of widely available SC hardware), and gain all the benefits of
more potential hires, reduced development time, and fewer bugs?

I have compared arguments against strong memory ordering with those
against floating-point. Von Neumann argued for fixed point as follows <https://booksite.elsevier.com/9780124077263/downloads/historial%20perspectives/section_3.11.pdf>:

|[...] human time is consumed in arranging for the introduction of
|suitable scale factors. We only argue that the time consumed is a
|very small percentage of the total time we will spend in preparing an |interesting problem for our machine. The first advantage of the
|floating point is, we feel, somewhat illusory. In order to have such
|a floating point, one must waste memory capacity which could
|otherwise be used for carrying more digits per word.

Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:

|Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
|seemed to imply that 40-bit arithmetic would hardly ever deliver
|usable accuracy for the solution of so few as 100 linear equations in
|100 unknowns; but by 1954 engineers were solving bigger systems
|routinely and getting satisfactory accuracy from arithmetics with no
|more than 40 bits.

The flaw in the reasoning of the paper was:

|To solve it more easily without floating–point von Neumann had
|transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
|doubling the number of sig. bits lost to ill-condition

This is an example of how the supposed gains that the harder-to-use
interface provides (in this case the bits "wasted" on the exponent)
are overcompensated by then having to use a software workaround for
the harder-to-use interface.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From aph@aph@littlepinkcloud.invalid to comp.arch on Mon Nov 18 11:56:48 2024

From Newsgroup: comp.arch

jseigh <jseigh_es00@xemaps.com> wrote:

On 11/16/24 16:21, Chris M. Thomasson wrote:

Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.

That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.

It's back in C++20. I think the problem wasn't so much implementing
it, which as you say can be trivially done by aliasing with acquire,
but specifying it. We use load dependency ordering in Java on AArch64
to satisfy some memory model requirements, so it's not as if it's
difficult to use.

Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.

Yes, exactly. It's not as if this is an issue that affects people who
program in high level languages, it's about what language implementers
choose to do.

Andrew.
--- Synchronet 3.20a-Linux NewsLink 1.114

From aph@aph@littlepinkcloud.invalid to comp.arch on Mon Nov 18 12:03:55 2024

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification?

I don't know. Given the contortions that the Linux kernel people had
to go through, maybe it really was present in hardware.

As a programming language implementer, I don't much think about "Will
the hardware really do this?" because new hardware arises all the
time, and I don't want users' programs to stop working.

Andrew.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 18 15:20:57 2024

From Newsgroup: comp.arch

On 11/17/2024 11:11 PM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using
std::memory_order_seq_cst", well, that is a violation right off the bat.

...

I am trying to say you might not be hired if you only knew how to handle
std::memory_order_seq_cst wrt C++... ?

I am not looking to be hired.

In any case, this cuts both ways: If you are an employer working on multi-threaded software, say, for Windows or Linux, will you reduce
your pool of potential hires by including a requirement like the one
above? And then pay for longer development time and additional
hard-to-find bugs coming from overshooting the requirement you stated
above. Or do you limit your software support to TSO hardware (for
lack of widely available SC hardware), and gain all the benefits of
more potential hires, reduced development time, and fewer bugs?

I have compared arguments against strong memory ordering with those
against floating-point. Von Neumann argued for fixed point as follows <https://booksite.elsevier.com/9780124077263/downloads/historial%20perspectives/section_3.11.pdf>:

|[...] human time is consumed in arranging for the introduction of
|suitable scale factors. We only argue that the time consumed is a
|very small percentage of the total time we will spend in preparing an |interesting problem for our machine. The first advantage of the
|floating point is, we feel, somewhat illusory. In order to have such
|a floating point, one must waste memory capacity which could
|otherwise be used for carrying more digits per word.

Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:

|Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
|seemed to imply that 40-bit arithmetic would hardly ever deliver
|usable accuracy for the solution of so few as 100 linear equations in
|100 unknowns; but by 1954 engineers were solving bigger systems
|routinely and getting satisfactory accuracy from arithmetics with no
|more than 40 bits.

The flaw in the reasoning of the paper was:

|To solve it more easily without floating–point von Neumann had |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
|doubling the number of sig. bits lost to ill-condition

This is an example of how the supposed gains that the harder-to-use
interface provides (in this case the bits "wasted" on the exponent)
are overcompensated by then having to use a software workaround for
the harder-to-use interface.

well, if you used std::memory_order_seq_cst to implement, say, a mutex
and/or spinlock memory barrier logic, well, that would raise a red flag
in my mind... Not good.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 18 15:34:03 2024

From Newsgroup: comp.arch

On 11/18/2024 3:20 PM, Chris M. Thomasson wrote:

On 11/17/2024 11:11 PM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely >>>>> have to. If you say something akin to "I do everything using
std::memory_order_seq_cst", well, that is a violation right off the >>>>> bat.

...

I am trying to say you might not be hired if you only knew how to handle >>> std::memory_order_seq_cst wrt C++... ?

I am not looking to be hired.

In any case, this cuts both ways: If you are an employer working on
multi-threaded software, say, for Windows or Linux, will you reduce
your pool of potential hires by including a requirement like the one
above? And then pay for longer development time and additional
hard-to-find bugs coming from overshooting the requirement you stated
above. Or do you limit your software support to TSO hardware (for
lack of widely available SC hardware), and gain all the benefits of
more potential hires, reduced development time, and fewer bugs?

I have compared arguments against strong memory ordering with those
against floating-point. Von Neumann argued for fixed point as follows
<https://booksite.elsevier.com/9780124077263/downloads/
historial%20perspectives/section_3.11.pdf>:

|[...] human time is consumed in arranging for the introduction of
|suitable scale factors. We only argue that the time consumed is a
|very small percentage of the total time we will spend in preparing an
|interesting problem for our machine. The first advantage of the
|floating point is, we feel, somewhat illusory. In order to have such
|a floating point, one must waste memory capacity which could
|otherwise be used for carrying more digits per word.

Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:

|Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
|seemed to imply that 40-bit arithmetic would hardly ever deliver
|usable accuracy for the solution of so few as 100 linear equations in
|100 unknowns; but by 1954 engineers were solving bigger systems
|routinely and getting satisfactory accuracy from arithmetics with no
|more than 40 bits.

The flaw in the reasoning of the paper was:

|To solve it more easily without floating–point von Neumann had
|transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
|doubling the number of sig. bits lost to ill-condition

This is an example of how the supposed gains that the harder-to-use
interface provides (in this case the bits "wasted" on the exponent)
are overcompensated by then having to use a software workaround for
the harder-to-use interface.

well, if you used std::memory_order_seq_cst to implement, say, a mutex and/or spinlock memory barrier logic, well, that would raise a red flag
in my mind... Not good.

Don't tell me you want all of std::memory_order_* to default to std::memory_order_seq_cst? If your on a system that only has seq_cst and nothing else, okay, but not on other weaker (memory order) systems, right?
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 18 19:37:50 2024

From Newsgroup: comp.arch

On 11/17/2024 7:17 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.

That's something between the user of a programming language and the
compiler. If you use a programming language or compiler that gives
weaker memory ordering guarantees than the architecture it compiles
to, that's your choice. Nothing forces compilers to behave that way,
and it's actually easier to write compilers that do not do such
reordering.

Or maybe disable reordering or optimization altogether
for those target architectures.

So you want to throw out the baby with the bathwater.

No, keep the weak order systems and not throw them out wrt a system that
is 100% seq_cst? Perhaps? What am I missing here?
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 18 19:38:41 2024

From Newsgroup: comp.arch

On 11/17/2024 1:30 PM, Chris M. Thomasson wrote:

On 11/17/2024 6:03 AM, jseigh wrote:

On 11/16/24 16:21, Chris M. Thomasson wrote:

Fwiw, in C++ std::memory_order_consume is useful for traversing a
node based stack of something in RCU. In most systems it only acts
like a compiler barrier. On the Alpha, it must emit a membar
instruction. Iirc, mb for alpha? Cannot remember that one right now.

That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.

Strange! C++ has:

https://en.cppreference.com/w/cpp/atomic/atomic_signal_fence

That only deals with compilers, not the arch memory order... Humm...

Interesting Joe!

Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect. Or maybe disable reordering or optimization altogether
for those target architectures.

Indeed. The compiler needs to know about these things. Iirc, there was
an old post over c.p.t that deals with a compiler (think it was GCC)
that messed up a pthread try lock for a mutex. It's a very old post. But
I remember it for sure.

a song for contention:

https://youtu.be/Sdq4T3iRV80?list=RDMMy3hf0T4qpYg

;^D
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Nov 20 14:03:29 2024

From Newsgroup: comp.arch

On 11/17/2024 6:03 AM, jseigh wrote:

On 11/16/24 16:21, Chris M. Thomasson wrote:

Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction.
Iirc, mb for alpha? Cannot remember that one right now.

That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.

Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.

Or maybe disable reordering or optimization altogether
for those target architectures.

^^^^^^^^^^^^

Yeah. No shit. Some people say the std::memory_order_* is too complex.
Others says its okay. Shit happens.

I remember way before C++11 I would feel nervous about keeping link time optimizations on because I thought they mess around with my custom
assembly language code for my sensitive thread sync algorithms. I
thought that a compiler would say okay this is calling into externally assembled/compiled code. No optimizations.... Well, link time
optimization scared me.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Nov 20 14:08:11 2024

From Newsgroup: comp.arch

On 11/17/2024 6:03 AM, jseigh wrote:

On 11/16/24 16:21, Chris M. Thomasson wrote:

Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction.
Iirc, mb for alpha? Cannot remember that one right now.

That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.

Horrible! So on the SPARC in RMO mode if I used
std::memory_order_consume to traverse a linked data structure in RCU it
would use a god damn #LoadStore | #LoadLoad for every node load? If so,
YIKES! Might as well use a single acquire after load the head of the list.

Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect. Or maybe disable reordering or optimization altogether
for those target architectures.

Joe Seigh

--- Synchronet 3.20a-Linux NewsLink 1.114

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Thu Nov 21 05:46:56 2024

From Newsgroup: comp.arch

In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>,
<aph@littlepinkcloud.invalid> wrote:

Kent Dickey <kegs@provalid.com> wrote:

Even better, let's look at the actual words for Pick Basic Dependency:

---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory
from E1 to E2
b) E1 and E2 are the same effect

I don't understand this. However, here are the actual words:

Pick Basic dependency

A Pick Basic dependency from a read Register effect or read Memory
effect R1 to a Register effect or Memory effect E2 exists if one
of the following applies:

• There is a Dependency through registers and memory from R1 to E2.
• There is an Intrinsic Control dependency from R1 to E2.
• There is a Pick Basic dependency from R1 to an Effect E3 and
there is a Pick Basic dependency from E3 to E2.

Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.

Andrew.

Where did you get that from? I cannot find it in the current Arm document DDI0487K_a_a-profile-architecture_reference_manual.pdf. Get it from https://developer.arm.com/documentation/ddi0487/ka/?lang=en

My text for Pick Basic dependency is a quote (where I label the lines
1a,1b, etc., where it's just bullets in the Arm document) from page B2-239, middle of the page.

That sort of "summary" was exactly what I was asking for, but I don't see it, so can you please name the page?

I'm pretty sure there are confusing typos all through this section
(E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
was a doozy.

It's likely the wording was better in an earlier document, I've noticed
this section getting more opaque over time.

Kent
--- Synchronet 3.20a-Linux NewsLink 1.114

From aph@aph@littlepinkcloud.invalid to comp.arch on Thu Nov 21 17:41:33 2024

From Newsgroup: comp.arch

Kent Dickey <kegs@provalid.com> wrote:

In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>, <aph@littlepinkcloud.invalid> wrote:

Kent Dickey <kegs@provalid.com> wrote:

Even better, let's look at the actual words for Pick Basic Dependency:

---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect

You're right, they do seem to have forgotten to define Explicit Memory
Read effect. I'm sure they meant to.

b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory >>> from E1 to E2
b) E1 and E2 are the same effect

I don't understand this. However, here are the actual words:

Pick Basic dependency

A Pick Basic dependency from a read Register effect or read Memory
effect R1 to a Register effect or Memory effect E2 exists if one
of the following applies:

. There is a Dependency through registers and memory from R1 to E2. >> . There is an Intrinsic Control dependency from R1 to E2.
. There is a Pick Basic dependency from R1 to an Effect E3 and
there is a Pick Basic dependency from E3 to E2.

Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.

Where did you get that from? I cannot find it in the current Arm document DDI0487K_a_a-profile-architecture_reference_manual.pdf. Get it from https://developer.arm.com/documentation/ddi0487/ka/?lang=en

Err, the previous version of the same document. :-)

My text for Pick Basic dependency is a quote (where I label the lines
1a,1b, etc., where it's just bullets in the Arm document) from page B2-239, middle of the page.

That sort of "summary" was exactly what I was asking for, but I don't see it, so can you please name the page?

B2-174 in DDI0487J

I'm pretty sure there are confusing typos all through this section
(E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
was a doozy.

It's likely the wording was better in an earlier document, I've noticed
this section getting more opaque over time.

So it seems. I think everything in DDI0487J was meant to be there in
DDI0487K, but it looks like it's all been macro-expanded and some
things fell off the page, because reasons. I believe the author of the
earlier, easier-to-read version of the Memory Model left Arm for
another company. If it's any consolation, the version of the MM before
he rewrote it was absolutely incomprehensible.

Andrew.
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Nov 22 15:45:20 2024

From Newsgroup: comp.arch

aph@littlepinkcloud.invalid writes:

Kent Dickey <kegs@provalid.com> wrote:

In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>,
<aph@littlepinkcloud.invalid> wrote:

Kent Dickey <kegs@provalid.com> wrote:

Even better, let's look at the actual words for Pick Basic Dependency:

That sort of "summary" was exactly what I was asking for, but I don't see it,
so can you please name the page?

B2-174 in DDI0487J

I'm pretty sure there are confusing typos all through this section
(E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
was a doozy.

It's likely the wording was better in an earlier document, I've noticed
this section getting more opaque over time.

So it seems. I think everything in DDI0487J was meant to be there in >DDI0487K, but it looks like it's all been macro-expanded and some
things fell off the page, because reasons.

Between DDI0487G and DDI0487H, they completely rewrote the ARM
using a requirements based description rather than the straightforward
prose in prior editions.

They've been wordsmithing it in every subsequent version.

I consider the prose version to be more readable, myself.

--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	991
Nodes:	10 (0 / 10)
Uptime:	119:51:49
Calls:	12,958
Files:	186,574
Messages:	3,265,641

Arm ldaxr / stxr loop question

Who's Online

System Info