• Arm ldaxr / stxr loop question

    From jseigh@jseigh_es00@xemaps.com to comp.arch on Mon Oct 28 15:13:03 2024
    From Newsgroup: comp.arch

    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The "lock" operation has memory order acquire semantics and
    we see that in part in the ldaxr but the store isn't part
    of that. We could append an additional acquire memory barrier
    but would that be necessary.

    Loads from the locked critical region could move forward of
    the stxr but there's a control dependency from cbnz branch
    instruction so they would be speculative loads until the
    loop exited.

    You'd still potentially have loads before the store of
    the lockword but in this case that's not a problem
    since it's known the lockword was 0 and no stores
    from prior locked code could occur.

    This should be analogous to rmw atomics like CAS but
    I've no idea what the internal hardware implementations
    are. Though on platforms without CAS the C11 atomics
    are implemented with LD/SC logic.

    Is this sort of what's going on or is the explicit
    acquire memory barrier still needed?

    Joe Seigh

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 31 19:12:43 2024
    From Newsgroup: comp.arch

    On Mon, 28 Oct 2024 19:13:03 +0000, jseigh wrote:

    So if were to implement a spinlock using the above instructions
    something along the lines of

    ..L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The "lock" operation has memory order acquire semantics and
    we see that in part in the ldaxr but the store isn't part
    of that. We could append an additional acquire memory barrier
    but would that be necessary.

    Loads from the locked critical region could move forward of
    the stxr but there's a control dependency from cbnz branch
    instruction so they would be speculative loads until the
    loop exited.

    You'd still potentially have loads before the store of
    the lockword but in this case that's not a problem
    since it's known the lockword was 0 and no stores
    from prior locked code could occur.

    This should be analogous to rmw atomics like CAS but
    I've no idea what the internal hardware implementations
    are. Though on platforms without CAS the C11 atomics
    are implemented with LD/SC logic.

    Is this sort of what's going on or is the explicit
    acquire memory barrier still needed?

    Joe Seigh

    My guess is that so few of us understand ARM fence
    mechanics that we cannot address teh asked question.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Oct 31 12:39:43 2024
    From Newsgroup: comp.arch

    On 10/28/2024 12:13 PM, jseigh wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
        ldaxr    -- load lockword exclusive w/ acquire membar
        cmp      -- compare to zero
        bne  .LO -- loop if currently locked
            stxr     -- store 1
            cbnz .LO -- retry if stxr failed

    The "lock" operation has memory order acquire semantics and
    we see that in part in the ldaxr but the store isn't part
    of that.  We could append an additional acquire memory barrier
    but would that be necessary.

    I am not well versed with arm. On the sparc for locking a spinlock it basically goes like:

    atomic logic that locks the spinlock
    MEMBAR #LoadStore | #LoadLoad

    // critical section

    MEMBAR #LoadStore | #StoreStore
    atomic logic that unlocks the spinlock


    Now, this is different than some spinlock logic aka, Peterson's
    algorithm that requires a #StoreLoad in the atomic logic itself that
    actually locks the spinlock. Basically, it does the same thing that the original SMR does. A store followed by a load to a different location
    must hold. RMO aside, even TSO cannot handle that without a membar...




    Loads from the locked critical region could move forward of
    the stxr but there's a control dependency from cbnz branch
    instruction so they would be speculative loads until the
    loop exited.

    You'd still potentially have loads before the store of
    the lockword but in this case that's not a problem
    since it's known the lockword was 0 and no stores
    from prior locked code could occur.

    This should be analogous to rmw atomics like CAS but
    I've no idea what the internal hardware implementations
    are.  Though on platforms without CAS the C11 atomics
    are implemented with LD/SC logic.

    Is this sort of what's going on or is the explicit
    acquire memory barrier still needed?

    Joe Seigh


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 31 20:35:58 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 28 Oct 2024 19:13:03 +0000, jseigh wrote:

    So if were to implement a spinlock using the above instructions
    something along the lines of

    ..L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The "lock" operation has memory order acquire semantics and
    we see that in part in the ldaxr but the store isn't part
    of that. We could append an additional acquire memory barrier
    but would that be necessary.

    Loads from the locked critical region could move forward of
    the stxr but there's a control dependency from cbnz branch
    instruction so they would be speculative loads until the
    loop exited.

    You'd still potentially have loads before the store of
    the lockword but in this case that's not a problem
    since it's known the lockword was 0 and no stores
    from prior locked code could occur.

    This should be analogous to rmw atomics like CAS but
    I've no idea what the internal hardware implementations
    are. Though on platforms without CAS the C11 atomics
    are implemented with LD/SC logic.

    Is this sort of what's going on or is the explicit
    acquire memory barrier still needed?

    Joe Seigh

    My guess is that so few of us understand ARM fence
    mechanics that we cannot address teh asked question.

    Load-Acquire Exclusive Register derives an address from a base
    register value, loads a 32-bit word or 64-bit doubleword from memory,
    and writes it to a register. The memory access is atomic. The PE marks
    the physical address being accessed as an exclusive access. This exclusive
    access mark is checked by Store Exclusive instructions. See Synchronization
    and semaphores. The instruction also has memory ordering semantics as
    described in Load-Acquire, Load-AcquirePC, and Store-Release. For
    information about memory accesses, see Load/store addressing modes.


    Arm provides a set of instructions with Acquire semantics for loads,
    and Release semantics for stores. These instructions support the
    Release Consistency sequentially consistent (RCsc) model. In addition,
    FEAT_LRCPC provides Load-AcquirePC instructions. The combination of
    Load-AcquirePC and Store-Release can be use to support the weaker Release
    Consistency processor consistent (RCpc) model.

    The full definitions of the Load-Acquire and Load-AcquirePC instructions
    are covered formally in the Definition of the Arm memory model.

    https://developer.arm.com/documentation/102105/latest/
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From aph@aph@littlepinkcloud.invalid to comp.arch on Fri Nov 1 16:17:49 2024
    From Newsgroup: comp.arch

    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The "lock" operation has memory order acquire semantics and
    we see that in part in the ldaxr but the store isn't part
    of that. We could append an additional acquire memory barrier
    but would that be necessary.

    After the store exclusive, you mean? No, it would not be necessary.

    This should be analogous to rmw atomics like CAS but
    I've no idea what the internal hardware implementations
    are. Though on platforms without CAS the C11 atomics
    are implemented with LD/SC logic.

    Is this sort of what's going on or is the explicit
    acquire memory barrier still needed?

    All of the implementations of things like POSIX mutexes I've seen on
    AArch64 use acquire alone.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 2 12:10:30 2024
    From Newsgroup: comp.arch

    On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The "lock" operation has memory order acquire semantics and
    we see that in part in the ldaxr but the store isn't part
    of that. We could append an additional acquire memory barrier
    but would that be necessary.

    After the store exclusive, you mean? No, it would not be necessary.

    Ahhhh! I just learned something about ARM right here. I am so used to
    the acquire membar being placed _after_ the atomic logic that locks the spinlock.

    .L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    So this acts just like a SPARC style:

    atomically_lock_spinlock();
    membar #LoadStore | #LoadLoad

    right?



    This should be analogous to rmw atomics like CAS but
    I've no idea what the internal hardware implementations
    are. Though on platforms without CAS the C11 atomics
    are implemented with LD/SC logic.

    Is this sort of what's going on or is the explicit
    acquire memory barrier still needed?

    All of the implementations of things like POSIX mutexes I've seen on
    AArch64 use acquire alone.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Fri Nov 8 03:17:51 2024
    From Newsgroup: comp.arch

    On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:

    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The closest I could find to this was on page 8367
    of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:

    Loop
    LDAXR W5, [X1] ; read lock with acquire
    CBNZ W5, Loop ; check if 0
    STXR W5, W0, [X1] ; attempt to store new value
    CBNZ W5, Loop ; test if store succeeded and retry if not
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Nov 8 14:19:16 2024
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:

    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The closest I could find to this was on page 8367
    of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:

    DDI0487K_a is the most recent.


    Loop
    LDAXR W5, [X1] ; read lock with acquire
    CBNZ W5, Loop ; check if 0
    STXR W5, W0, [X1] ; attempt to store new value
    CBNZ W5, Loop ; test if store succeeded and retry if not

    A real world example from the linux kernel:

    static __always_inline s64
    __ll_sc_atomic64_dec_if_positive(atomic64_t *v)
    {
    s64 result;
    unsigned long tmp;

    asm volatile("// atomic64_dec_if_positive\n"
    " prfm pstl1strm, %2\n"
    "1: ldxr %0, %2\n"
    " subs %0, %0, #1\n"
    " b.lt 2f\n"
    " stlxr %w1, %0, %2\n"
    " cbnz %w1, 1b\n"
    " dmb ish\n"
    "2:"
    : "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
    :
    : "cc", "memory");

    return result;
    }

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 8 13:40:17 2024
    From Newsgroup: comp.arch

    On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
    On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
            ldaxr    -- load lockword exclusive w/ acquire membar
            cmp      -- compare to zero
            bne  .LO -- loop if currently locked
             stxr     -- store 1
             cbnz .LO -- retry if stxr failed

    The "lock" operation has memory order acquire semantics and
    we see that in part in the ldaxr but the store isn't part
    of that.  We could append an additional acquire memory barrier
    but would that be necessary.

    After the store exclusive, you mean? No, it would not be necessary.

    Ahhhh! I just learned something about ARM right here. I am so used to
    the acquire membar being placed _after_ the atomic logic that locks the spinlock.

    .L0
             ldaxr    -- load lockword exclusive w/ acquire membar
             cmp      -- compare to zero
             bne  .LO -- loop if currently locked
              stxr     -- store 1
              cbnz .LO -- retry if stxr failed

    So this acts just like a SPARC style:

    atomically_lock_spinlock();
    membar #LoadStore | #LoadLoad

    right?

    Fwiw, I am basically asking if the "store" stxr has implied acquire
    semantics wrt the "load" ldaxr? I am guess that it does... This would
    imply that the acquire membar (#LoadStore | #LoadLoad) would be
    respected by the store at stxr wrt its "attached?" load wrt ldaxr?

    Is this basically right? Or, what am I missing here? Thanks.

    The membar logic wrt acquire needs to occur _after_ the atomic logic
    that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
    needs to occur _before_ the atomic logic that unlocks said spinlock.

    Am I missing anything wrt ARM? ;^o





    This should be analogous to rmw atomics like CAS but
    I've no idea what the internal hardware implementations
    are.  Though on platforms without CAS the C11 atomics
    are implemented with LD/SC logic.

    Is this sort of what's going on or is the explicit
    acquire memory barrier still needed?

    All of the implementations of things like POSIX mutexes I've seen on
    AArch64 use acquire alone.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Nov 8 22:45:51 2024
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
    On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of


    Fwiw, I am basically asking if the "store" stxr has implied acquire >semantics wrt the "load" ldaxr? I am guess that it does... This would
    imply that the acquire membar (#LoadStore | #LoadLoad) would be
    respected by the store at stxr wrt its "attached?" load wrt ldaxr?

    Is this basically right? Or, what am I missing here? Thanks.

    The membar logic wrt acquire needs to occur _after_ the atomic logic
    that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
    needs to occur _before_ the atomic logic that unlocks said spinlock.

    Am I missing anything wrt ARM? ;^o

    Did you read the extensive description of memory semantics
    in the ARMv8 ARM? See page 275 in DDI0487K_a.

    https://developer.arm.com/documentation/ddi0487/ka/?lang=en
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 8 14:56:51 2024
    From Newsgroup: comp.arch

    On 11/8/2024 2:45 PM, Scott Lurndal wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
    On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of


    Fwiw, I am basically asking if the "store" stxr has implied acquire
    semantics wrt the "load" ldaxr? I am guess that it does... This would
    imply that the acquire membar (#LoadStore | #LoadLoad) would be
    respected by the store at stxr wrt its "attached?" load wrt ldaxr?

    Is this basically right? Or, what am I missing here? Thanks.

    The membar logic wrt acquire needs to occur _after_ the atomic logic
    that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
    needs to occur _before_ the atomic logic that unlocks said spinlock.

    Am I missing anything wrt ARM? ;^o

    Did you read the extensive description of memory semantics
    in the ARMv8 ARM? See page 275 in DDI0487K_a.

    https://developer.arm.com/documentation/ddi0487/ka/?lang=en

    I did not! So I am flying a mostly blind here. I don't really have any experience with how ARM handles these types of things. Just guessing
    that the store would honor the acquire of the load? Or, does the store
    need a membar and the load does not need acquire at all? I know that the membar should be after the final store that actually locks the spinlock
    wrt Joe's example.

    I just need to RTFM!!!!

    Sorry about that Scott. ;^o

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of pessimistic CAS RMW type of logic?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 8 15:03:39 2024
    From Newsgroup: comp.arch

    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    On 11/8/2024 2:45 PM, Scott Lurndal wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
    On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of


    Fwiw, I am basically asking if the "store" stxr has implied acquire
    semantics wrt the "load" ldaxr? I am guess that it does... This would
    imply that the acquire membar (#LoadStore | #LoadLoad) would be
    respected by the store at stxr wrt its "attached?" load wrt ldaxr?

    Is this basically right? Or, what am I missing here? Thanks.

    The membar logic wrt acquire needs to occur _after_ the atomic logic
    that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
    needs to occur _before_ the atomic logic that unlocks said spinlock.

    Am I missing anything wrt ARM? ;^o

    Did you read the extensive description of memory semantics
    in the ARMv8 ARM?   See page 275 in DDI0487K_a.

    https://developer.arm.com/documentation/ddi0487/ka/?lang=en

    I did not! So I am flying a mostly blind here. I don't really have any experience with how ARM handles these types of things. Just guessing
    that the store would honor the acquire of the load? Or, does the store
    need a membar and the load does not need acquire at all? I know that the membar should be after the final store that actually locks the spinlock
    wrt Joe's example.

    I just need to RTFM!!!!

    Sorry about that Scott. ;^o

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of pessimistic CAS RMW type of logic?

    LL/SC vs cmpxchg8b?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Nov 8 23:36:24 2024
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?

    LL/SC vs cmpxchg8b?

    The two concepts are orthogonal in my experience.

    ARM saw the deficiences of LL/SC very early in the
    V8 architectural definition, and added a set of
    atomic instructions for scalability to large processor
    counts - one advantage is that the atomic operations
    can be delegated to a cache level or memory, thus potentially
    a very minor power savings in cases where contention is
    common (although such LL/SC try loops often include the ARM
    equivalent of the x86 PAUSE or MWAIT instructions to
    all power savings during the spin).

    Do read B2.3 Definition of the Arm memory model. It's only 32 pages,
    and very clearly defines the memory model.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jseigh@jseigh_es00@xemaps.com to comp.arch on Fri Nov 8 19:34:55 2024
    From Newsgroup: comp.arch

    On 11/8/24 17:56, Chris M. Thomasson wrote:
    On 11/8/2024 2:45 PM, Scott Lurndal wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
    On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of


    Fwiw, I am basically asking if the "store" stxr has implied acquire
    semantics wrt the "load" ldaxr? I am guess that it does... This would
    imply that the acquire membar (#LoadStore | #LoadLoad) would be
    respected by the store at stxr wrt its "attached?" load wrt ldaxr?

    Is this basically right? Or, what am I missing here? Thanks.

    The membar logic wrt acquire needs to occur _after_ the atomic logic
    that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
    needs to occur _before_ the atomic logic that unlocks said spinlock.

    Am I missing anything wrt ARM? ;^o

    Did you read the extensive description of memory semantics
    in the ARMv8 ARM?   See page 275 in DDI0487K_a.

    https://developer.arm.com/documentation/ddi0487/ka/?lang=en

    I did not! So I am flying a mostly blind here. I don't really have any experience with how ARM handles these types of things. Just guessing
    that the store would honor the acquire of the load? Or, does the store
    need a membar and the load does not need acquire at all? I know that the membar should be after the final store that actually locks the spinlock
    wrt Joe's example.

    I just need to RTFM!!!!

    Sorry about that Scott. ;^o

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of pessimistic CAS RMW type of logic?


    In this case the the stxr doesn't need a memory barrier.
    Loads can move forward of it but not forward of the ldaxr
    because it has acquire semantics. For a lock that's ok
    since the stxr would fail if any other thread acquired
    the lock the conditional branch would make the loads
    speculative if the stxr failed I believe.

    Joe Seigh
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Nov 8 21:00:53 2024
    From Newsgroup: comp.arch

    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?

    LL/SC vs cmpxchg8b?

    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Nov 9 14:23:47 2024
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?

    LL/SC vs cmpxchg8b?

    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 9 13:08:31 2024
    From Newsgroup: comp.arch

    On 11/9/2024 6:23 AM, Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?

    LL/SC vs cmpxchg8b?

    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Ohhhhhh... That's nice! So, we can use both flavors wrt optimistic and pessimistic? Fwiw, I still did not get a chance to read up on the docs
    you so kindly linked me to.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 9 13:47:43 2024
    From Newsgroup: comp.arch

    On 11/8/2024 4:34 PM, jseigh wrote:
    On 11/8/24 17:56, Chris M. Thomasson wrote:
    On 11/8/2024 2:45 PM, Scott Lurndal wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
    On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of


    Fwiw, I am basically asking if the "store" stxr has implied acquire
    semantics wrt the "load" ldaxr? I am guess that it does... This would
    imply that the acquire membar (#LoadStore | #LoadLoad) would be
    respected by the store at stxr wrt its "attached?" load wrt ldaxr?

    Is this basically right? Or, what am I missing here? Thanks.

    The membar logic wrt acquire needs to occur _after_ the atomic logic
    that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
    needs to occur _before_ the atomic logic that unlocks said spinlock.

    Am I missing anything wrt ARM? ;^o

    Did you read the extensive description of memory semantics
    in the ARMv8 ARM?   See page 275 in DDI0487K_a.

    https://developer.arm.com/documentation/ddi0487/ka/?lang=en

    I did not! So I am flying a mostly blind here. I don't really have any
    experience with how ARM handles these types of things. Just guessing
    that the store would honor the acquire of the load? Or, does the store
    need a membar and the load does not need acquire at all? I know that
    the membar should be after the final store that actually locks the
    spinlock wrt Joe's example.

    I just need to RTFM!!!!

    Sorry about that Scott. ;^o

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?


    In this case the the stxr doesn't need a memory barrier.

    So, once that stxr completes it already has acquire membar semantics
    from its prior load wrt its acquire? Never mind. I am busy right now on
    some other things.


    Loads can move forward of it but not forward of the ldaxr
    because it has acquire semantics.  For a lock that's ok
    since the stxr would fail if any other thread acquired
    the lock the conditional branch would make the loads
    speculative if the stxr failed I believe.

    Joe Seigh

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 9 14:26:16 2024
    From Newsgroup: comp.arch

    On 11/8/2024 6:19 AM, Scott Lurndal wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:

    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The closest I could find to this was on page 8367
    of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:

    DDI0487K_a is the most recent.


    Loop
    LDAXR W5, [X1] ; read lock with acquire
    CBNZ W5, Loop ; check if 0
    STXR W5, W0, [X1] ; attempt to store new value
    CBNZ W5, Loop ; test if store succeeded and retry if not

    A real world example from the linux kernel:

    static __always_inline s64
    __ll_sc_atomic64_dec_if_positive(atomic64_t *v)
    {
    s64 result;
    unsigned long tmp;

    asm volatile("// atomic64_dec_if_positive\n"
    " prfm pstl1strm, %2\n"
    "1: ldxr %0, %2\n"
    " subs %0, %0, #1\n"
    " b.lt 2f\n"
    " stlxr %w1, %0, %2\n"
    " cbnz %w1, 1b\n"
    " dmb ish\n"

    "dmb ish" is interesting to me for some reason...



    "2:"
    : "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
    :
    : "cc", "memory");

    return result;
    }


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Nov 9 23:18:14 2024
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/8/2024 6:19 AM, Scott Lurndal wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:


    A real world example from the linux kernel:

    static __always_inline s64
    __ll_sc_atomic64_dec_if_positive(atomic64_t *v)
    {
    s64 result;
    unsigned long tmp;

    asm volatile("// atomic64_dec_if_positive\n"
    " prfm pstl1strm, %2\n"
    "1: ldxr %0, %2\n"
    " subs %0, %0, #1\n"
    " b.lt 2f\n"
    " stlxr %w1, %0, %2\n"
    " cbnz %w1, 1b\n"
    " dmb ish\n"

    "dmb ish" is interesting to me for some reason...

    Data Memory Barrior - inner sharable coherency domain
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Nov 10 01:26:22 2024
    From Newsgroup: comp.arch

    On Sat, 9 Nov 2024 23:18:14 +0000, Scott Lurndal wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/8/2024 6:19 AM, Scott Lurndal wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:


    A real world example from the linux kernel:

    static __always_inline s64
    __ll_sc_atomic64_dec_if_positive(atomic64_t *v)
    {
    s64 result;
    unsigned long tmp;

    asm volatile("// atomic64_dec_if_positive\n"
    " prfm pstl1strm, %2\n"
    "1: ldxr %0, %2\n"
    " subs %0, %0, #1\n"
    " b.lt 2f\n"
    " stlxr %w1, %0, %2\n"
    " cbnz %w1, 1b\n"
    " dmb ish\n"

    "dmb ish" is interesting to me for some reason...

    Data Memory Barrior - inner sharable coherency domain

    It reads better without explanation ...
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Sun Nov 10 01:37:26 2024
    From Newsgroup: comp.arch

    On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

    It reads better without explanation ...

    Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

    Can anybody find any other example of any IBM engineer ever having a sense
    of humour? Ever?

    Anybody?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Nov 10 02:44:39 2024
    From Newsgroup: comp.arch

    On Sun, 10 Nov 2024 1:37:26 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

    It reads better without explanation ...

    Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

    Can anybody find any other example of any IBM engineer ever having a
    sense of humour?

    We got past our censors (management) a control register
    in Mc 88100 called FPECR -- Floating Point Exception
    Control Register. We were rather happy about it, too.

    Ed Rupp (wrote the 68020/30) µCode assembler. Due to the
    way we implemented µROM, we could interchange rows and
    columns to optimize various stuff. We (the engineers)
    got together one night and rearranged the rows and
    columns such that if you looked at µROM from a good
    distance back, you would see "Moto Man Lives" in bits
    across the ROM. ...
    Actually got in trouble for that one ...

    Ever?

    Anybody?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Nov 10 16:00:23 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Nov 10 23:08:21 2024
    From Newsgroup: comp.arch

    On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction?

    The advantage is consuming OpCode space at breathtaking speed.
    Oh wait...

    Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    Because the memory model was not build with the notion of memory order
    and that not all ATOMIC events start or end with a recognizable inst-
    ruction. Having ATOMICs announce their beginning and ending eliminates
    the need for fencing; even if you keep a <relatively> relaxed memory
    order model.

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.

    Blame Leslie Lamport for those requirements.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Nov 11 12:41:08 2024
    From Newsgroup: comp.arch

    On Sun, 10 Nov 2024 16:00:23 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead
    of pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.



    The correct question is not "Why to have them?", but "Why not?".
    In the ISA with fixed 32-bit instructions and with 32 GPRs, opcode space
    for 2-reg operations without immediate is extremely cheap.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 11 13:57:55 2024
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>> pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence >synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.


    "Limited ordering regions allow large systems to perform
    special Load-Acquire and Store-Release instructions that
    provide order between the memory accesses to a region of
    the PA map as observed by a limited set of observers."

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 11 13:59:22 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>> pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction?

    The advantage is consuming OpCode space at breathtaking speed.
    Oh wait...

    Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    Because the memory model was not build with the notion of memory order
    and that not all ATOMIC events start or end with a recognizable inst- >ruction. Having ATOMICs announce their beginning and ending eliminates
    the need for fencing; even if you keep a <relatively> relaxed memory
    order model.

    There are fully atomic instructions, the load/store exclusives are
    generally there for backward compatability with armv7; the full set
    of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
    ARMv8.1.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Nov 11 16:28:48 2024
    From Newsgroup: comp.arch

    On Mon, 11 Nov 2024 13:59:22 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC
    instead of pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction?

    The advantage is consuming OpCode space at breathtaking speed.
    Oh wait...

    Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    Because the memory model was not build with the notion of memory
    order and that not all ATOMIC events start or end with a
    recognizable inst- ruction. Having ATOMICs announce their beginning
    and ending eliminates the need for fencing; even if you keep a
    <relatively> relaxed memory order model.

    There are fully atomic instructions, the load/store exclusives are
    generally there for backward compatability with armv7; the full set
    of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
    ARMv8.1.


    Also for compatibility with Cortex-A53 which is still a significant
    part of installed base.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jseigh@jseigh_es00@xemaps.com to comp.arch on Mon Nov 11 09:56:44 2024
    From Newsgroup: comp.arch

    On 11/11/24 08:59, Scott Lurndal wrote:


    There are fully atomic instructions, the load/store exclusives are
    generally there for backward compatability with armv7; the full set
    of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
    ARMv8.1.


    They added the atomics for scalability allegedly. ARM never
    stated what the actual issue was. I suspect they couldn't
    guarantee a memory lock size small enough to eliminate
    destructive interference. Like cache line size instead
    of word size.

    Joe Seigh
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Nov 11 11:30:46 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>> pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).
    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.
    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence
    synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.


    "Limited ordering regions allow large systems to perform
    special Load-Acquire and Store-Release instructions that
    provide order between the memory accesses to a region of
    the PA map as observed by a limited set of observers."

    Ok, so that explains LoadLOAcquire, StoreLORelease as they are
    functionally different: it needs to associate the fence with specific
    load and store addresses so it can determine a physical LORegion,
    if any, and thereby limit the scope of the fence actions to that LOR.

    But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
    Why attach a specific kind of fence action to the general LD or ST?
    They do the same thing in the atomic instructions, eg:

    LDADDB, LDADDAB, LDADDALB, LDADDLB
    Atomic add on byte in memory atomically loads an 8-bit byte from memory,
    adds the value held in a register to it, and stores the result back to
    memory. The value initially loaded from memory is returned in the
    destination register.
    - If the destination register is not WZR, LDADDAB and LDADDALB load from
    memory with acquire semantics.
    - LDADDLB and LDADDALB store to memory with release semantics.
    - LDADDB has neither acquire nor release semantics.

    And this goes on and on for all the other atomic ops, SWP, CAS, CLR, EOR,
    SET, SMIN, SMAX, UMIN, UMAX, and data sizes, half, word, dblword, pair.

    What happens if like Apple you want Processor Consistency model too -
    instead of just adding one new fence instruction, do they have to add
    all the atomic instructions (ops * sizes) in again?



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 11 17:11:10 2024
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>>> pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).
    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.
    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence
    synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.


    "Limited ordering regions allow large systems to perform
    special Load-Acquire and Store-Release instructions that
    provide order between the memory accesses to a region of
    the PA map as observed by a limited set of observers."

    Ok, so that explains LoadLOAcquire, StoreLORelease as they are
    functionally different: it needs to associate the fence with specific
    load and store addresses so it can determine a physical LORegion,
    if any, and thereby limit the scope of the fence actions to that LOR.

    But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
    Why attach a specific kind of fence action to the general LD or ST?
    They do the same thing in the atomic instructions, eg:

    Note that the atomics were added in V8.1, and were optional at that
    time.

    From the ARMv8 ARM:

    Arm provides a set of instructions with Acquire semantics for
    loads, and Release semantics for stores. These instructions
    support the Release Consistency sequentially consistent (RCsc) model.
    In addition, FEAT_LRCPC provides Load-AcquirePC instructions. The
    combination of Load-AcquirePC and Store-Release can be use to
    support the weaker Release Consistency processor consistent (RCpc) model. --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 11 17:17:56 2024
    From Newsgroup: comp.arch

    jseigh <jseigh_es00@xemaps.com> writes:
    On 11/11/24 08:59, Scott Lurndal wrote:


    There are fully atomic instructions, the load/store exclusives are
    generally there for backward compatability with armv7; the full set
    of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
    ARMv8.1.


    They added the atomics for scalability allegedly. ARM never
    stated what the actual issue was. I suspect they couldn't
    guarantee a memory lock size small enough to eliminate
    destructive interference. Like cache line size instead
    of word size.

    Speculation is seldom accurate. I would suggest that it
    is more likely that there were requests from ARM customers
    who were looking to build larger SMP systems and it had been
    clear for decades that LL/SC could not scale to larger
    processor counts.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Malcolm Beattie@mbeattie@clueful.co.uk to comp.arch on Mon Nov 11 18:17:54 2024
    From Newsgroup: comp.arch

    On 2024-11-10, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

    It reads better without explanation ...

    Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

    Can anybody find any other example of any IBM engineer ever having a sense of humour? Ever?

    One of the resource types in JES2, the batch subsystem for z/OS, is
    BERT ("Block Extension Reuse Table") and needs some sizing/tuning by
    the sysprog. Not too noticeable as humourous but for low-level use
    from Assembler some of the macros which manipulate them allow you to
    (1) copy one into memory, i.e. "Deliver Or Get" a BERT
    (2) define a hook to get control when a BERT is released, i.e
    "Do It Later" for a BERT release.
    (3) generate a control block for a related data area, i.e. a
    "Collector Attribute Table" for BERTs.

    These macros are
    (1) $DOGBERT
    (2) $DILBERT
    (3) $CATBERT

    --Malcolm
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 11 13:53:43 2024
    From Newsgroup: comp.arch

    On 11/11/2024 6:56 AM, jseigh wrote:
    On 11/11/24 08:59, Scott Lurndal wrote:


    There are fully atomic instructions,  the load/store exclusives are
    generally there for backward compatability with armv7; the full set
    of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
    ARMv8.1.


    They added the atomics for scalability allegedly.  ARM never
    stated what the actual issue was.  I suspect they couldn't
    guarantee a memory lock size small enough to eliminate
    destructive interference.  Like cache line size instead
    of word size.

    For some reason it reminds me of the size of a reservation granule wrt
    LL/SC.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 11 14:02:22 2024
    From Newsgroup: comp.arch

    On 11/11/2024 9:11 AM, Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>>>> pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).
    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.
    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction? >>>>
    The execution time of each is the same, and the main cost is the fence >>>> synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.


    "Limited ordering regions allow large systems to perform
    special Load-Acquire and Store-Release instructions that
    provide order between the memory accesses to a region of
    the PA map as observed by a limited set of observers."

    Ok, so that explains LoadLOAcquire, StoreLORelease as they are
    functionally different: it needs to associate the fence with specific
    load and store addresses so it can determine a physical LORegion,
    if any, and thereby limit the scope of the fence actions to that LOR.

    But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
    Why attach a specific kind of fence action to the general LD or ST?
    They do the same thing in the atomic instructions, eg:

    Note that the atomics were added in V8.1, and were optional at that
    time.

    From the ARMv8 ARM:

    Arm provides a set of instructions with Acquire semantics for
    loads, and Release semantics for stores. These instructions
    support the Release Consistency sequentially consistent (RCsc) model.
    In addition, FEAT_LRCPC provides Load-AcquirePC instructions. The
    combination of Load-AcquirePC and Store-Release can be use to
    support the weaker Release Consistency processor consistent (RCpc) model.

    It sure seems like the "weaker" release is similar to unlocking a
    spinlock with a store in x86, MOV because it already has implied release membar semantics aka (#LoadStore | #StoreStore).
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From aph@aph@littlepinkcloud.invalid to comp.arch on Tue Nov 12 12:14:47 2024
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    All this, and much more can be discovered by reading the AMBA
    specifications. However, the main point is that the content of the
    target address does not have to be transferred to the local cache:
    these are remote atomic operations. Quite nice for things like
    fire-and-forget counters, for example.

    The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Nov 12 13:55:25 2024
    From Newsgroup: comp.arch

    Malcolm Beattie <mbeattie@clueful.co.uk> schrieb:
    On 2024-11-10, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

    It reads better without explanation ...

    Reminds me of the “EIEIO” instruction from IBM POWER (or was it only
    PowerPC).

    Can anybody find any other example of any IBM engineer ever having a sense >> of humour? Ever?

    One of the resource types in JES2, the batch subsystem for z/OS, is
    BERT ("Block Extension Reuse Table") and needs some sizing/tuning by
    the sysprog. Not too noticeable as humourous but for low-level use
    from Assembler some of the macros which manipulate them allow you to
    (1) copy one into memory, i.e. "Deliver Or Get" a BERT
    (2) define a hook to get control when a BERT is released, i.e
    "Do It Later" for a BERT release.
    (3) generate a control block for a related data area, i.e. a
    "Collector Attribute Table" for BERTs.

    These macros are
    (1) $DOGBERT
    (2) $DILBERT
    (3) $CATBERT

    Do you know if these macros existed before 1993, when Dilbert was
    first released?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 14:10:14 2024
    From Newsgroup: comp.arch

    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    All this, and much more can be discovered by reading the AMBA
    specifications. However, the main point is that the content of the
    target address does not have to be transferred to the local cache:
    these are remote atomic operations. Quite nice for things like fire-and-forget counters, for example.

    The execution time of each is the same, and the main cost is the fence
    synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a store followed by a load to another location to hold. LoadStore is not strong enough. The SMR algorithm needs that. Iirc, Peterson's algorithms needs
    it as well.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 14:25:48 2024
    From Newsgroup: comp.arch

    On 11/11/2024 1:53 PM, Chris M. Thomasson wrote:
    On 11/11/2024 6:56 AM, jseigh wrote:
    On 11/11/24 08:59, Scott Lurndal wrote:


    There are fully atomic instructions,  the load/store exclusives are
    generally there for backward compatability with armv7; the full set
    of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
    ARMv8.1.


    They added the atomics for scalability allegedly.  ARM never
    stated what the actual issue was.  I suspect they couldn't
    guarantee a memory lock size small enough to eliminate
    destructive interference.  Like cache line size instead
    of word size.

    For some reason it reminds me of the size of a reservation granule wrt LL/SC.

    For some reason I remember way back wrt having to pad and align things
    on reservation granule's back on PPC. Iirc, it was the "anchor"
    structure. The nodes were aligned and padded up to l2 cache lines. This
    was 20+ years ago! damn it. Time goes by. Uggg. ;^o
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Tue Nov 12 16:55:40 2024
    From Newsgroup: comp.arch

    On 11/12/2024 6:14 AM, aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    All this, and much more can be discovered by reading the AMBA
    specifications. However, the main point is that the content of the
    target address does not have to be transferred to the local cache:
    these are remote atomic operations. Quite nice for things like fire-and-forget counters, for example.


    I ended up mostly with a simpler model, IMO:
    Normal / RAM-like: Fetch cache line, write back when evicting;
    Operations: LoadTile, StoreTile, SwapTile,
    LoadPrefetch, StorePrefetch
    Volatile (RAM like): Fetch, operate, write-back;
    MMIO: Remote Load/Store/Swap request;
    Operation is performed on target;
    Currently only supports DWORD and QWORD access;
    Operations are strictly sequential.

    In theory, MMIO access could be added to RAM, but unclear if worth the
    added cost and complexity of doing so. Could more easily enforce strict consistency.

    The LoadPrefetch and StorePrefetch operations:
    LoadPrefetch, try to perform a load from RAM
    Always responds immediately
    Signals whether it was an L2 hit or L2 Miss.
    StorePrefetch
    Basically like LoadPrefetch
    Signals that the intention is to write to memory.


    In my cache and bus design, I sometimes refer to cache lines as "tiles"
    partly because of how I viewed them as operating, which didn't exactly
    match the online descriptions of cache lines.

    Say:
    Tile:
    16 bytes in the current implementation.
    Accessed in even and odd rows
    A memory access may span an even tile and an odd tile;
    The L1 caches need to have a matched pair of tiles for an access.
    Cache Line:
    Usually described as always 32 bytes;
    Descriptions seemed to assume only a single row of lines in caches.
    Generally no mention of allowing for an even/odd scheme.

    Seemingly, a cache that operated with cache lines would use a single row
    of 32-bit cache lines, with misaligned accesses presumably spanning a
    pair of adjacent cache lines. To fit with BRAM access patterns, would
    likely need to split lines in half, and then mirror the relevant tag
    bits (to allow detecting hit/miss).

    However, online descriptions generally made no mention of how misaligned accesses were intended to be handled within the limits of a dual-ported
    RAM (1R1W).


    My L2 cache operates in a way more like that of traditional descriptions
    of cache lines, except that they are currently 64 bytes in my L2 cache
    (and internally subdivided into four 16-byte parts).

    The use of 64 bytes was mostly because this size got the most bandwidth
    with my DDR interface (with 16 or 32 byte transfers, more cycles are
    spent overhead; however latency was lower).

    In this case, the L2<->RAM interface:
    512 bit Load Data
    512 bit Store Data
    Load Address
    Store Address
    Request Code (IDLE/LOAD/STORE/SWAP)
    Request Sequence Number
    Response Code (READY/OK/HOLD/FAIL)
    Response Sequence Number

    Originally, there were no sequence numbers, and IDLE/READY signaling was
    used between each request (needed to return to this state before
    starting a new request). The sequence numbers avoided needing to return
    to an IDLE/READY state, allowing the bandwidth over this interface to be nearly doubled.

    In a SWAP request, the Load and Store are performed end to end.

    General bandwidth for a 16-bit DDR2 chip running at 50MHz (DLL disabled, effectively a low-power / standby mode) is ~ 90 MB/sec (or 47 MB/s each direction for SWAP), which is fairly close to the theoretical limit (internally, the logic for the DDR controller runs at 100MHz, driving IO
    as 100MHz SDR, albeit using both posedge and negedge for sampling
    responses from the DDR chip, so ~ 200 MHz if seen as SDR).

    Theoretically, would be faster to access the chip using the SERDES
    interface, but:
    Hadn't gone up the learning curve for this;
    Unclear if I could really effectively utilize the bandwidth with a 50MHz
    CPU and my current bus;
    Actual bandwidth gains would be smaller, as then CAS and RAS latency
    would dominate.

    Could in theory have used Vivado MIG, but then I would have needed to
    deal with AXI, and never crossed the threshold of wanting to deal with AXI.


    Between CPU, L2, and various other devices, I am using a ringbus:
    Connections:
    128 bits data;
    48 bits address (96 bits between L1 caches and TLB);
    16 bits: request/response code and flags;
    16 bits: source/dest node and request sequence number;
    Each node has a set of input and output connections;
    Each node may modify a request/response,
    or simply forward from input to output.
    Messages move along at one position per clock cycle.
    Generally also 50 MHz at present (*1).

    *1: Pretty much everything (apart from some hardware interfaces) runs on
    the same clock. Some devices needed faster clocks. Any slower clocks
    were generally faked using accumulator dividers (add a fraction every clock-cycle and use the MSB of the accumulator as the virtual clock).


    Comparably, the per-node logic cost isn't too high, nor is the logic complexity. However, performance of the ring is very sensitive to ring
    latency (and there are some amount of hacks to try to reduce the overall latency of the ring in common paths).


    At present, the highest resolution video modes that can be managed semi-effectively are 640x400 and 640x480 256-color (60Hz), or ~ 20 MB/sec.

    Can do 800x600 or similar in RGBI or color-cell modes (640x400 or
    640x480 CC also being an option). Theoretically, there is a 1024x768 monochrome mode, but this is mostly untested. The 4-color and monochrome
    modes had optional Bayer-pattern sub-modes to mimic full color.

    Main modes I have ended up using:
    80x25 and 80x50 text/color-cell modes;
    Text and color cell graphics exist in the same mode.
    320x200 hi-color (RGB555);
    640x400 indexed 256 color.


    Trying to go much higher than this, and the combination of ringbus
    latency and L2 misses turns the display into a broken mess (with a DRAM
    backed framebuffer). Originally, I had the framebuffer in Block-RAM, but
    this in turn set the hard-limit based on framebuffer size (and putting framebuffer in DRAM allowing for a bigger L2 cache).

    Theoretically, could allow higher resolution modes by adding a fast path between the display output and DDR RAM interface (with access then being multiplexed with the L2 cache). Have not done so.

    Or, possible but more radical:
    Bolt the VGA output module directly to the L2 cache;
    Could theoretically do 800x600 high-color
    Would eat around 2/3 of total RAM bandwidth.

    Major concern here is that setting resolutions too high would starve the
    CPU of the ability to access memory (vs the current situation where
    trying to set higher resolutions mostly results in progressively worse
    display glitches).

    Logic would need to be in place so that display can't totally hog the
    RAM interface. If doing so, may also make sense to move from color-cell
    and block-organized memory to fully raster oriented frame-buffers.

    Though, despite being more wonky / non-standard, the block-oriented framebuffer layout has tended to play along better with memory fetch. A
    raster oriented framebuffer is much more sensitive to timing and access-latency issues compared with 4x4 or 8x8 pixel blocks, with the
    display working on an internal cache of around 2 .. 4 rows of blocks.

    Raster generally needs results to be streamed in-order and at a
    consistent latency, whereas blocks can use hit/miss handling, with a
    hit/miss probe running ahead of the current raster position (and
    hopefully able to get the block fetched before it is time to display
    it). Though, did add logic to the display to avoid sensing new prefetch requests for a block if it is still waiting for a response on that block (mostly as otherwise the VRAM cache was spamming the ringbus with
    excessive prefetch requests). Where, in this case, the VRAM was using exclusively prefetch requests during screen refresh.

    In practice, it may not matter that much if the hardware framebuffer is block-ordered rather than raster. The OS's display driver is the only
    thing that really needs to care. Main case where it could arguably
    "actually matter" being full-screen programs using a DirectDraw style interface, but likely doesn't matter that much if the program is given
    an off-screen framebuffer to draw into rather than the actual hardware framebuffer (with the contents being copied over during a "buffer swap" event).

    But, as noted, I was mostly using a partly GDI+VfW inspired interface,
    which seems "mostly OK". Difference in overhead isn't that large; and
    "Draw this here bitmap onto this HDC" offers a certain level of hardware abstraction; as there is no implicit assumption that the pixel format in
    ones' bitmap object needs to match the format and layout of the display device.


    Nevermind if for GUI like operation, programs/windows were mostly
    operating in hi-color, with stuff being awkwardly converted to 256 color during window-stack redraw. Granted, pretty sure old-style Windows
    didn't work this way, and per-window framebuffers eat a lot of RAM (note
    that the shell had tabs, but all the tabs share a single window
    framebuffer; rather each has a separate character cell buffer, and the
    cells are redrawn to the window buffer either when switching tabs or
    when more text is printed).

    Had considered option for 256-color or 16 color window buffers (to save
    RAM), but haven't done so yet (for now, if drawing a 16 or 256 color
    bitmap, it is internally converted to hi-color). More likely, would
    switch to 256 color window buffers if using a 256 color output mode (so conversion to 256 color would be handled when drawing the bitmap to the window, rather than later in the process).


    Well, I guess sort of similar wonk that the internal audio mixing is
    using 16-bit PCM, whereas the output is A-Law (for the hardware loop
    buffer). But, a case could be made for doing the OS level audio mixing
    as Binary16.


    Either way, longer term future of my project is uncertain...

    And, unclear if a success or failure.

    It mostly did about as well as I could expect.

    Never did achieve my original goal of "fast enough to run Quake at
    decent framerates", but partly because younger self didn't realize
    everything would be stuck at 50 MHz (or that 75 or 100 MHz core would
    end up needing to be comparably anemic scalar RISC cores; which still
    can't really get good framerates in Quake, *).

    *: A 100 MHz RV64G core still will not make Quake fast. Extra not helped
    if one needs to use smaller L1 caches and generate stalls on memory RAW hazards, ...


    Sadly, closest I have gotten thus far involves GLQuake and a hardware rasterizer module. And then the problem that performance is more limited
    by trying to get geometry processed and fed into the module, than its
    ability to walk edges and rasterize stuff. Would seemingly ideally need something that could also perform transforms and deal with
    perspective-correct texture filtering (vs CPU side transforms, and
    dynamic subdivision + affine texture filtering).

    ...


    The execution time of each is the same, and the main cost is the fence
    synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Andrew.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From aph@aph@littlepinkcloud.invalid to comp.arch on Tue Nov 12 23:02:13 2024
    From Newsgroup: comp.arch

    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jseigh@jseigh_es00@xemaps.com to comp.arch on Tue Nov 12 18:55:42 2024
    From Newsgroup: comp.arch

    On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Andrew.

    Does ARM use acquire and release differently than everyone else?
    I'm not sure where StoreLoad fits in with those.

    Joe Seigh
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Nov 13 00:29:42 2024
    From Newsgroup: comp.arch

    On Tue, 12 Nov 2024 23:02:13 +0000, aph@littlepinkcloud.invalid wrote:

    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    1990-1992: I was working on Mc88120. It had a conditional cache--a
    place to store store-data until the store instruction became consistent.
    After becoming consistent, the store data would migrate to L1 or on
    to DRAM, ... This structure could be probed for memory order rather
    similar to what ARM is doing.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 16:32:00 2024
    From Newsgroup: comp.arch

    On 11/12/2024 3:02 PM, aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad.

    Ahhh. So, well, it makes me think of the implied StoreLoad in x86/x64
    LOCK'ed RMW's...? Does this make any sense to you? Or, am I wandering
    around in a damn field somewhere! ;^o

    I am so used to SPARC style in RMO mode. The LoadStore should be _after_
    any "naked", but atomic logic that acquires and releases a spinlock...
    ;^o acquire with regard to the memory barrier logic, -not- the atomic
    logic that gains the lock itself....


    LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 17:43:33 2024
    From Newsgroup: comp.arch

    On 11/12/2024 4:32 PM, Chris M. Thomasson wrote:
    On 11/12/2024 3:02 PM, aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a >>>> STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad.

    Ahhh. So, well, it makes me think of the implied StoreLoad in x86/x64 LOCK'ed RMW's...? Does this make any sense to you? Or, am I wandering
    around in a damn field somewhere! ;^o

    I am so used to SPARC style in RMO mode. The LoadStore should be _after_
    any "naked", but atomic logic that acquires and releases a
    spinlock...

    I need to clarify here. Shit. The LoadStore LoadLoad should be after the atomic logic that acquires the spinlock. It should be before the release LoadStore StoreStore that should occur before the atomic logic that
    releases the spinlock. Humm... How much more complicated can I make it?
    Sorry.


    __________
    naked_atomic_lock()
    membar #LoadStore | #LoadLoad

    // locked region

    membar #LoadStore | #StoreStore
    naked_atomic_unlock();
    __________

    Damn it. Sorry everybody!

    [...]
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 17:50:11 2024
    From Newsgroup: comp.arch

    On 11/12/2024 3:02 PM, aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Iirc, a sequential membar was strange on SPARC. I have seen things like
    this before wrt RMO mode:

    membar #StoreLoad | #LoadStore | #LoadLoad | #StoreStore

    shit. It's been a while! damn.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Nov 12 17:53:58 2024
    From Newsgroup: comp.arch

    On 11/12/2024 2:55 PM, BGB wrote:
    On 11/12/2024 6:14 AM, aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    All this, and much more can be discovered by reading the AMBA
    specifications. However, the main point is that the content of the
    target address does not have to be transferred to the local cache:
    these are remote atomic operations. Quite nice for things like
    fire-and-forget counters, for example.


    I ended up mostly with a simpler model, IMO:
      Normal / RAM-like: Fetch cache line, write back when evicting;
        Operations: LoadTile, StoreTile, SwapTile,
          LoadPrefetch, StorePrefetch
      Volatile (RAM like): Fetch, operate, write-back;
      MMIO: Remote Load/Store/Swap request;
        Operation is performed on target;
        Currently only supports DWORD and QWORD access;
        Operations are strictly sequential.

    In theory, MMIO access could be added to RAM, but unclear if worth the
    added cost and complexity of doing so. Could more easily enforce strict consistency.

    The LoadPrefetch and StorePrefetch operations:
      LoadPrefetch, try to perform a load from RAM
        Always responds immediately
        Signals whether it was an L2 hit or L2 Miss.
      StorePrefetch
        Basically like LoadPrefetch
        Signals that the intention is to write to memory.


    In my cache and bus design, I sometimes refer to cache lines as "tiles" partly because of how I viewed them as operating, which didn't exactly
    match the online descriptions of cache lines.

    Say:
      Tile:
        16 bytes in the current implementation.
        Accessed in even and odd rows
          A memory access may span an even tile and an odd tile;
          The L1 caches need to have a matched pair of tiles for an access.
      Cache Line:
        Usually described as always 32 bytes;
        Descriptions seemed to assume only a single row of lines in caches.
          Generally no mention of allowing for an even/odd scheme.

    Seemingly, a cache that operated with cache lines would use a single row
    of 32-bit cache lines, with misaligned accesses presumably spanning a
    pair of adjacent cache lines. To fit with BRAM access patterns, would
    likely need to split lines in half, and then mirror the relevant tag
    bits (to allow detecting hit/miss).

    However, online descriptions generally made no mention of how misaligned accesses were intended to be handled within the limits of a dual-ported
    RAM (1R1W).


    My L2 cache operates in a way more like that of traditional descriptions
    of cache lines, except that they are currently 64 bytes in my L2 cache
    (and internally subdivided into four 16-byte parts).

    The use of 64 bytes was mostly because this size got the most bandwidth
    with my DDR interface (with 16 or 32 byte transfers, more cycles are
    spent overhead; however latency was lower).

    In this case, the L2<->RAM interface:
      512 bit Load Data
      512 bit Store Data
      Load Address
      Store Address
      Request Code (IDLE/LOAD/STORE/SWAP)
      Request Sequence Number
      Response Code (READY/OK/HOLD/FAIL)
      Response Sequence Number

    Originally, there were no sequence numbers, and IDLE/READY signaling was used between each request (needed to return to this state before
    starting a new request). The sequence numbers avoided needing to return
    to an IDLE/READY state, allowing the bandwidth over this interface to be nearly doubled.

    In a SWAP request, the Load and Store are performed end to end.

    General bandwidth for a 16-bit DDR2 chip running at 50MHz (DLL disabled, effectively a low-power / standby mode) is ~ 90 MB/sec (or 47 MB/s each direction for SWAP), which is fairly close to the theoretical limit (internally, the logic for the DDR controller runs at 100MHz, driving IO
    as 100MHz SDR, albeit using both posedge and negedge for sampling
    responses from the DDR chip, so ~ 200 MHz if seen as SDR).

    Theoretically, would be faster to access the chip using the SERDES interface, but:
    Hadn't gone up the learning curve for this;
    Unclear if I could really effectively utilize the bandwidth with a 50MHz
    CPU and my current bus;
    Actual bandwidth gains would be smaller, as then CAS and RAS latency
    would dominate.

    Could in theory have used Vivado MIG, but then I would have needed to
    deal with AXI, and never crossed the threshold of wanting to deal with AXI.


    Between CPU, L2, and various other devices, I am using a ringbus:
      Connections:
        128 bits data;
        48 bits address (96 bits between L1 caches and TLB);
        16 bits: request/response code and flags;
        16 bits: source/dest node and request sequence number;
      Each node has a set of input and output connections;
        Each node may modify a request/response,
          or simply forward from input to output.
        Messages move along at one position per clock cycle.
          Generally also 50 MHz at present (*1).

    *1: Pretty much everything (apart from some hardware interfaces) runs on
    the same clock. Some devices needed faster clocks. Any slower clocks
    were generally faked using accumulator dividers (add a fraction every clock-cycle and use the MSB of the accumulator as the virtual clock).


    Comparably, the per-node logic cost isn't too high, nor is the logic complexity. However, performance of the ring is very sensitive to ring latency (and there are some amount of hacks to try to reduce the overall latency of the ring in common paths).


    At present, the highest resolution video modes that can be managed semi- effectively are 640x400 and 640x480 256-color (60Hz), or ~ 20 MB/sec.

    Can do 800x600 or similar in RGBI or color-cell modes (640x400 or
    640x480 CC also being an option). Theoretically, there is a 1024x768 monochrome mode, but this is mostly untested. The 4-color and monochrome modes had optional Bayer-pattern sub-modes to mimic full color.

    Main modes I have ended up using:
      80x25 and 80x50 text/color-cell modes;
        Text and color cell graphics exist in the same mode.
      320x200 hi-color (RGB555);
      640x400 indexed 256 color.


    Trying to go much higher than this, and the combination of ringbus
    latency and L2 misses turns the display into a broken mess (with a DRAM backed framebuffer). Originally, I had the framebuffer in Block-RAM, but this in turn set the hard-limit based on framebuffer size (and putting framebuffer in DRAM allowing for a bigger L2 cache).

    Theoretically, could allow higher resolution modes by adding a fast path between the display output and DDR RAM interface (with access then being multiplexed with the L2 cache). Have not done so.

    Or, possible but more radical:
      Bolt the VGA output module directly to the L2 cache;
      Could theoretically do 800x600 high-color
        Would eat around 2/3 of total RAM bandwidth.

    Major concern here is that setting resolutions too high would starve the
    CPU of the ability to access memory (vs the current situation where
    trying to set higher resolutions mostly results in progressively worse display glitches).

    Logic would need to be in place so that display can't totally hog the
    RAM interface. If doing so, may also make sense to move from color-cell
    and block-organized memory to fully raster oriented frame-buffers.

    Though, despite being more wonky / non-standard, the block-oriented framebuffer layout has tended to play along better with memory fetch. A raster oriented framebuffer is much more sensitive to timing and access- latency issues compared with 4x4 or 8x8 pixel blocks, with the display working on an internal cache of around 2 .. 4 rows of blocks.

    Raster generally needs results to be streamed in-order and at a
    consistent latency, whereas blocks can use hit/miss handling, with a hit/miss probe running ahead of the current raster position (and
    hopefully able to get the block fetched before it is time to display
    it). Though, did add logic to the display to avoid sensing new prefetch requests for a block if it is still waiting for a response on that block (mostly as otherwise the VRAM cache was spamming the ringbus with
    excessive prefetch requests). Where, in this case, the VRAM was using exclusively prefetch requests during screen refresh.

    In practice, it may not matter that much if the hardware framebuffer is block-ordered rather than raster. The OS's display driver is the only
    thing that really needs to care. Main case where it could arguably
    "actually matter" being full-screen programs using a DirectDraw style interface, but likely doesn't matter that much if the program is given
    an off-screen framebuffer to draw into rather than the actual hardware framebuffer (with the contents being copied over during a "buffer swap" event).

    But, as noted, I was mostly using a partly GDI+VfW inspired interface,
    which seems "mostly OK". Difference in overhead isn't that large; and
    "Draw this here bitmap onto this HDC" offers a certain level of hardware abstraction; as there is no implicit assumption that the pixel format in ones' bitmap object needs to match the format and layout of the display device.


    Nevermind if for GUI like operation, programs/windows were mostly
    operating in hi-color, with stuff being awkwardly converted to 256 color during window-stack redraw. Granted, pretty sure old-style Windows
    didn't work this way, and per-window framebuffers eat a lot of RAM (note that the shell had tabs, but all the tabs share a single window
    framebuffer; rather each has a separate character cell buffer, and the
    cells are redrawn to the window buffer either when switching tabs or
    when more text is printed).

    Had considered option for 256-color or 16 color window buffers (to save RAM), but haven't done so yet (for now, if drawing a 16 or 256 color
    bitmap, it is internally converted to hi-color). More likely, would
    switch to 256 color window buffers if using a 256 color output mode (so conversion to 256 color would be handled when drawing the bitmap to the window, rather than later in the process).


    Well, I guess sort of similar wonk that the internal audio mixing is
    using 16-bit PCM, whereas the output is A-Law (for the hardware loop buffer). But, a case could be made for doing the OS level audio mixing
    as Binary16.


    Either way, longer term future of my project is uncertain...

    And, unclear if a success or failure.

    It mostly did about as well as I could expect.

    Never did achieve my original goal of "fast enough to run Quake at
    decent framerates", but partly because younger self didn't realize everything would be stuck at 50 MHz (or that 75 or 100 MHz core would
    end up needing to be comparably anemic scalar RISC cores; which still
    can't really get good framerates in Quake, *).

    *: A 100 MHz RV64G core still will not make Quake fast. Extra not helped
    if one needs to use smaller L1 caches and generate stalls on memory RAW hazards, ...


    Sadly, closest I have gotten thus far involves GLQuake and a hardware rasterizer module. And then the problem that performance is more limited
    by trying to get geometry processed and fed into the module, than its ability to walk edges and rasterize stuff. Would seemingly ideally need something that could also perform transforms and deal with perspective- correct texture filtering (vs CPU side transforms, and dynamic
    subdivision + affine texture filtering).

    ...


    The execution time of each is the same, and the main cost is the fence
    synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Andrew.


    Humm... It makes me think of, well... does an atomic RMW have implied
    membars, or are they completely separated akin to the SPARC membar instruction? LOCK'ed RMW on Intel, XCHG instruction aside wrt its
    implied LOCK prefix, well, they are StoreLoad! Shit.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 13 07:37:46 2024
    From Newsgroup: comp.arch

    aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Isn't this just reusing the normal forwarding network?

    If not found, you do as usual and start a regular load operation, but
    now you also know that you can skip the flushing of the same?

    PS. I do agree that it is a good idea (even patent-worthy?), but not
    brilliant since it is so very obvious in hindsight.

    To me brilliant is something that still isn't obvious after larning
    about it.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Nov 13 01:23:51 2024
    From Newsgroup: comp.arch

    On 11/12/2024 5:50 PM, Chris M. Thomasson wrote:
    On 11/12/2024 3:02 PM, aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a >>>> STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad.

    Very interesting... Need to ponder on this. Still running it using no
    memory barrier via a RCU based algorithm has to be faster. Humm...
    Membar free reads is very nice.



    LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't
    necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Iirc, a sequential membar was strange on SPARC. I have seen things like
    this before wrt RMO mode:

    membar #StoreLoad | #LoadStore | #LoadLoad | #StoreStore

    shit. It's been a while! damn.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 13 04:25:15 2024
    From Newsgroup: comp.arch



    The execution time of each is the same, and the main cost is the fence >>>> synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Andrew.


    Humm... It makes me think of, well... does an atomic RMW have implied membars, or are they completely separated akin to the SPARC membar instruction? LOCK'ed RMW on Intel, XCHG instruction aside wrt its
    implied LOCK prefix, well, they are StoreLoad! Shit.

    I am not sure how atomic memory ops are implemented through AMBA / AXI.
    I think AMBA / AXI is a very good bus to use. It turns out I have been
    using a similar proprietary bus (FTA) for my project. I have been
    working on an AXI bus bridge so that I can migrate my system to AXI.
    In FTA bus there is a command field on the bus that allows atomic memory
    ops to be specified. There does not seem to be an equivalent in AXI.
    Unless perhaps the user tag field is used.
    One thing about the AXI bus is I do not understand how the CAS
    instruction is supported. In my bus CAS is supported with double data on
    the bus. There are two data items that need to be supplied to the memory controller for CAS.
    Another issue run into was FTA uses the response bus to send MSI
    interrupts. I am thinking of using the AXI read response bus for this
    purpose, by sending an ERR response for interrupts with the read data containing the interrupt info. But I do not know if AXI devices will get confused seeing a read response without any read address previously
    supplied. I am assuming devices will be able to filter bus transactions
    using transaction ids.

    I have not been able to get the Q+ CPU to operate reliably in the FPGA,
    so I am stuck without a system CPU. I have given some thought to just
    using an FPGA with built in (ARM) CPU cores. I want to get working on peripheral cores.

    I have been using the MIG controller in native mode (non-AXI) coupled
    with a multi-port memory controller for access to DDR3 RAM. The MIG
    controller can supply a lot of bandwidth, but it has some latency to it.
    I measured it at 28 clocks IIRC. I think timing depends on the memory component too. But that is at the MIG controller frequency. In my case
    200 MHz. At the CPU frequency it is much less. While there is latency, a
    new request can be made almost every memory clock. To get a lot of
    bandwidth requestors like the frame buffer request an entire scan-line
    of data with back-to-back accesses to the MIG controller. The frame
    buffer uses a burst of 50 accesses, so it takes around 80 memory clock
    cycles. In terms of a 50 MHz CPU that would be only about 20 clocks. The
    frame buffer uses about 10% of the memory bandwidth and supports
    800x600x16bpp mode. I think it could support 1920x1080x16bpp video. But
    I did not want to spend that much bandwidth on video.
    The biggest issue I found with the MIG controller was specifying the
    right memory component.
    It is 900 MHz memory but seems to run okay at 800 MHz. 800 MHz was more conveniently matched to other clocks in the system.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 13 10:20:12 2024
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    To me brilliant is something that still isn't obvious after larning
    about it.

    Why do you think it's less brilliant to recognize something obvious
    that everybody else has overlooked?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From aph@aph@littlepinkcloud.invalid to comp.arch on Wed Nov 13 18:07:17 2024
    From Newsgroup: comp.arch

    jseigh <jseigh_es00@xemaps.com> wrote:
    On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:

    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't
    necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Does ARM use acquire and release differently than everyone else?
    I'm not sure where StoreLoad fits in with those.

    Yes. LDAR and STLR, used together, are sequentially consistent. This
    is a stronger guarantee than acquire and release.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From aph@aph@littlepinkcloud.invalid to comp.arch on Wed Nov 13 18:13:04 2024
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:


    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't
    necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Isn't this just reusing the normal forwarding network?

    If not found, you do as usual and start a regular load operation, but
    now you also know that you can skip the flushing of the same?

    Yes. As long as the data in the store buffer doesn't overlap with what
    you're about to write, you can ship the flushing.

    PS. I do agree that it is a good idea (even patent-worthy?), but not brilliant since it is so very obvious in hindsight.

    LOL! :-)


    To me brilliant is something that still isn't obvious after larning
    about it.

    You have very high standards.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 13 13:40:32 2024
    From Newsgroup: comp.arch

    On 11/13/2024 4:20 AM, Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    To me brilliant is something that still isn't obvious after larning
    about it.

    Why do you think it's less brilliant to recognize something obvious
    that everybody else has overlooked?


    Yeah.

    This is my feelings about some of the deficiencies of standard RISC-V.

    The stuff I want added, and have added as experiments, is not exactly non-obvious. Saw at least one person doubting that it would make much difference (namely in the use of "make the immediate bigger" prefixes).


    But, experimentally, it does make enough of a difference that it should
    be worth considering, at least for performance-oriented use-cases
    (likely, not really needed for microcontrollers, where the priority is
    more "cheap CPU" rather than "fast CPU").


    But, as I see it, if you can make binaries 40% smaller, and 35% faster,
    this is something that should be worth considering.

    As opposed to the C extension which IME seems to only give around a
    25-30% size reduction, and (with a CPU design that only does superscalar
    on properly aligned 32-bit instructions) actually makes performance
    slightly worse.

    Granted, having both jumbo prefixes and the 'C' extension being likely a
    best case for code density (though, BGBCC doesn't yet support the 'C' extension, so I can't test this).

    I am half tempted to move the RV jumbo prefixes from
    ...-100-kkkkk-00-11011 (ALUIW block)
    To:
    ...-100-kkkkk-00-00111 (JALR block)

    For "technical reasons" (well, would also clean up the encoding conflict
    with an older/dropped "ADDIWU" instruction). TBD if worth the break in compatibility though (if I did so, might consider also claiming 1xx for
    jumbo prefixes, say, to give an extra bit so that "JIMM+JIMM+LUI" could
    have enough bits to encode F0..F31 as well, but there are other
    possibilities for how to encode this).



    Most of these features have historical precedent as well, so should in
    theory be "safe" (similar sorts of prefixes existed in Transputer and
    Java VM).

    Granted, not found examples thus far in 1980s or 1990s RISC
    architectures (these sorts of prefixes didn't really seem to start
    appearing in RISC's until the early 2000s). Annoyingly, most precedent
    for the use of prefixes and prefix instructions seems to be in terms of
    CISC architectures.

    The closest direct equivalent of the Jumbo_Imm prefix I am aware of
    didn't appear until MicroBlaze, which is cutting it a little close (and
    have yet to verify if it existed in the original version of MicroBlaze).
    In any case, will probably be more safe in a few years (as MiceoBlaze
    moves further outside of the 20 year window).


    Register-Indexed Load/Store and similar were fairly widespread (80386,
    ARM32, and others), so should be safe.


    Can note that also, in BJX2, the general ideas behind WEX encoding also
    had precedent (was in use in 1990s DSP architectures and similar), ...


    Sometimes, there is an elegance in finding things sufficiently obvious
    that it is more a question why it is not more widespread.

    Or, avoiding things that require a non-trivial leap in logic, or pose difficulty in verifying the logic chains.


    Though, arguably, in terms of precedent, something like RISC-V is
    arguably fairly safe:
    Its core ISA lacks anything that didn't already have precedent by the
    early 1980s.


    But, as I see it, pretty much anything that has precedent earlier than
    ~2004 should be safe (which, as I see it, should include things like
    jumbo prefixes, etc).

    ...


    There are, granted, potential gotchas, like the years of hassle that
    S3TC and depth-fail shadows and similar caused.

    Where, S3TC should have been invalid, as it wasn't substantially
    different from what was already in common use in the 1980s.

    Seemingly, main arguable "novel" feature it had was defining the
    interpolated colors as 1/3 + 2/3 rather than 1-bit (A or B), or 3/8 +
    5/8 (as in some earlier Apple image formats).

    There was the "S2TC" workaround (just disallow interpolation entirely); theoretically though, someone could have just used DXT1/DXT5 mostly as
    is, but then redefined the interpolation as 3/8 + 5/8 as "close enough"...


    Similarly the depth-fail issue was also annoying. There was still
    depth-pass though, but this had some annoying edge cases that required workarounds (the shadows would break if the camera was inside a shadow
    volume, requiring a workaround).

    Depth-fail shadows should also be safe now.

    ...


    Well, and people can freely use FAT32, or (in theory) NTFS. Though, the
    design of NTFS itself is a bigger impediment to using it; though with
    some limited (newer features may still not be safe).

    A person should also be able to do their own off-brand implementations
    of x86-64 (*) and 32-bit ARM and Thumb/Thumb2.

    *: The original form of x86-64 should be safe, would mostly need to omit
    newer forms of SSE, and AVX, to be safe.

    ...


    May not be obvious, but admittedly, I am more someone that tries to
    avoid "novelty" (often things like cost/benefit concerns and historical precedent are given more weight).


    - anton

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Nov 13 13:08:38 2024
    From Newsgroup: comp.arch

    On 11/13/2024 10:07 AM, aph@littlepinkcloud.invalid wrote:
    jseigh <jseigh_es00@xemaps.com> wrote:
    On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:

    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't
    necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Does ARM use acquire and release differently than everyone else?
    I'm not sure where StoreLoad fits in with those.

    Yes. LDAR and STLR, used together, are sequentially consistent. This
    is a stronger guarantee than acquire and release.

    interesting!
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Thu Nov 14 06:24:32 2024
    From Newsgroup: comp.arch

    In article <YfxXO.384093$EEm7.56154@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    Do read B2.3 Definition of the Arm memory model. It's only 32 pages,
    and very clearly defines the memory model.

    Your definition of "clearly" differs from mine.

    Look at Pick dependencies on page B2-239 and B2-240:
    (I'm replacing complicating details with "blah blah" or "A, B, C", to
    highlight the issue I want to point out)

    ---
    Pick Basic dependency:
    There is A, B, C, or a Pick dependency between E1 and E2
    Pick Data dependency:
    There is a Pick Basic dependency from E1 to E2 and blah blah.
    Pick Address dependency:
    There is a Pick Data dependency from E1 to E3 and E2 is blah blah
    Pick Control dependency:
    This is a Pick Basic dependency from E1 to E3 and E2 is blah blah
    Pick Dependency:
    There is a Pick Basic, Pick Address, Pick Data, or Pick Control
    dependency from E1 to E2
    ---

    This is completely circular, and never defines what "pick" is.

    Even better, let's look at the actual words for Pick Basic Dependency:

    ---
    Pick Basic Dependency:
    There is a Pick Basic dependency from an effect E1 to an effect
    E2 if one of the following applies:
    1) One of the following applies:
    a) E1 is an Explicit Memory Read effect
    b) E1 is a Register Read effect
    2) One of the following applies:
    a) There is a Pick dependency through registers and memory
    from E1 to E2
    b) E1 and E2 are the same effect
    ---

    Using the words as they are written, if any of 1a, 1b, 2a, or 2a is
    true, a Pick Basic dependency exists between E1 and E2. To give
    background, E1 and E2 are any events (effects) and not necessarily in
    program order (E2 could be before E1, and can be on another CPU, the
    event numbering system is not defined to indicate program order and when
    they want to say E1 is in program order before E2 it seems to always
    explicitly say so, and there are LOTS of other places where they create
    a third event, E3, which may be between E1 and E2). I'm using event interchangeably with effect since I think effect is a terrible term.

    So by rule 1a by itself, a Pick Basic Dependency exists
    between a Load instruction (an example of an Explicit Memory Read, I'm assuming, as best as I can tell, an Explicit Memory Read is not really
    defined) and every other possible event in that system happening before or after that load.

    So what does this mean? I literally have no idea what they are trying to
    get at here.

    If E1 and E2 are the "same effect", does that mean it's the same instruction/operation, or just the same type of operation (like two loads),
    or what? If there was an "overview" summarizing ordering in English,
    then it I could interpret the looseness better.

    I want to make it clear that I don't want a formal grammar, I just think
    this is a particularly poor way to try to present this information.

    What it reads like to me like a bad one of those logic puzzles but with info missing: The cookie was eaten by someone wearing a red coat. Susan wears
    a hat. Who ate the cookie?

    Kent
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From aph@aph@littlepinkcloud.invalid to comp.arch on Thu Nov 14 09:23:23 2024
    From Newsgroup: comp.arch

    Kent Dickey <kegs@provalid.com> wrote:

    Even better, let's look at the actual words for Pick Basic Dependency:

    ---
    Pick Basic Dependency:
    There is a Pick Basic dependency from an effect E1 to an effect
    E2 if one of the following applies:
    1) One of the following applies:
    a) E1 is an Explicit Memory Read effect
    b) E1 is a Register Read effect
    2) One of the following applies:
    a) There is a Pick dependency through registers and memory
    from E1 to E2
    b) E1 and E2 are the same effect

    I don't understand this. However, here are the actual words:

    Pick Basic dependency

    A Pick Basic dependency from a read Register effect or read Memory
    effect R1 to a Register effect or Memory effect E2 exists if one
    of the following applies:

    • There is a Dependency through registers and memory from R1 to E2.
    • There is an Intrinsic Control dependency from R1 to E2.
    • There is a Pick Basic dependency from R1 to an Effect E3 and
    there is a Pick Basic dependency from E3 to E2.

    Seems reasonable enough in context, no? It's either a data dependency,
    a control dependency, or any transitive combination of them.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Nov 14 10:36:11 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    To me brilliant is something that still isn't obvious after larning
    about it.

    Why do you think it's less brilliant to recognize something obvious
    that everybody else has overlooked?

    I did not convey my intended meaning here, what I meant is that there
    are levels of brilliance, even when being the first to recognize something.

    Yeah, I have absolutely no issue with ideas that are only obvious in hindsight, they deserve praise. My real problem are those things that
    are new, but only because of the environment, as in the idea would be
    obvious to anyone "versed in the field".

    I.e US vs Norwegian (European?) patent law.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Nov 14 10:41:14 2024
    From Newsgroup: comp.arch

    aph@littlepinkcloud.invalid wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:


    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't
    necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Isn't this just reusing the normal forwarding network?

    If not found, you do as usual and start a regular load operation, but
    now you also know that you can skip the flushing of the same?

    Yes. As long as the data in the store buffer doesn't overlap with what
    you're about to write, you can ship the flushing.

    PS. I do agree that it is a good idea (even patent-worthy?), but not
    brilliant since it is so very obvious in hindsight.

    LOL! :-)


    To me brilliant is something that still isn't obvious after larning
    about it.

    You have very high standards.

    That is one of the reasons I never started a PhD track, I could never
    find an area of study that I thought would be sufficiently ground-breaking.

    The other reason is/was that my friend Andy "Crazy" Glew did try the PhD
    route for several years and hit the same stumbling block vs his
    advisors, and I know that Andy is an idea machine well beyond myself.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Nov 14 14:13:54 2024
    From Newsgroup: comp.arch

    On 11/14/2024 1:41 AM, Terje Mathisen wrote:
    aph@littlepinkcloud.invalid wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:


    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't >>>> necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Isn't this just reusing the normal forwarding network?

    If not found, you do as usual and start a regular load operation, but
    now you also know that you can skip the flushing of the same?

    Yes. As long as the data in the store buffer doesn't overlap with what
    you're about to write, you can ship the flushing.

    PS. I do agree that it is a good idea (even patent-worthy?), but not
    brilliant since it is so very obvious in hindsight.

    LOL!  :-)


    To me brilliant is something that still isn't obvious after larning
    about it.

    You have very high standards.

    That is one of the reasons I never started a PhD track, I could never
    find an area of study that I thought would be sufficiently ground-breaking.

    The other reason is/was that my friend Andy "Crazy" Glew did try the PhD route for several years and hit the same stumbling block vs his
    advisors, and I know that Andy is an idea machine well beyond myself.

    I had the chance to converse with him (Andy) as well. Wonderful!
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Nov 14 14:21:32 2024
    From Newsgroup: comp.arch

    On 11/14/2024 1:23 AM, aph@littlepinkcloud.invalid wrote:
    Kent Dickey <kegs@provalid.com> wrote:

    Even better, let's look at the actual words for Pick Basic Dependency:

    ---
    Pick Basic Dependency:
    There is a Pick Basic dependency from an effect E1 to an effect
    E2 if one of the following applies:
    1) One of the following applies:
    a) E1 is an Explicit Memory Read effect
    b) E1 is a Register Read effect
    2) One of the following applies:
    a) There is a Pick dependency through registers and memory >> from E1 to E2
    b) E1 and E2 are the same effect

    I don't understand this. However, here are the actual words:

    Pick Basic dependency

    A Pick Basic dependency from a read Register effect or read Memory
    effect R1 to a Register effect or Memory effect E2 exists if one
    of the following applies:

    • There is a Dependency through registers and memory from R1 to E2.
    • There is an Intrinsic Control dependency from R1 to E2.
    • There is a Pick Basic dependency from R1 to an Effect E3 and
    there is a Pick Basic dependency from E3 to E2.

    Seems reasonable enough in context, no? It's either a data dependency,
    a control dependency, or any transitive combination of them.

    data dependencies as in stronger than a Dec Alpha does not not honor
    data dependent loads?

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From aph@aph@littlepinkcloud.invalid to comp.arch on Thu Nov 14 23:20:02 2024
    From Newsgroup: comp.arch

    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/14/2024 1:23 AM, aph@littlepinkcloud.invalid wrote:

    Seems reasonable enough in context, no? It's either a data dependency,
    a control dependency, or any transitive combination of them.

    data dependencies as in stronger than a Dec Alpha does not not honor
    data dependent loads?

    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Nov 14 19:25:15 2024
    From Newsgroup: comp.arch

    On 11/14/2024 3:20 PM, aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/14/2024 1:23 AM, aph@littlepinkcloud.invalid wrote:

    Seems reasonable enough in context, no? It's either a data dependency,
    a control dependency, or any transitive combination of them.

    data dependencies as in stronger than a Dec Alpha does not not honor
    data dependent loads?

    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Ahhhh! Thank you. Btw, agreed.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Nov 14 19:50:55 2024
    From Newsgroup: comp.arch

    On 11/9/2024 5:37 PM, Lawrence D'Oliveiro wrote:
    On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

    It reads better without explanation ...

    Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

    Oh yeah! LOL! Thanks.


    Can anybody find any other example of any IBM engineer ever having a sense
    of humour? Ever?

    Anybody?

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 15 07:25:12 2024
    From Newsgroup: comp.arch

    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
    only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
    Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the operations
    of each individual processor appear in this sequence in the order
    specified by its program."

    However, I don't think that the Alpha architects considered the Alpha
    memory ordering to be an error, and probably still don't, just like
    the ARM architects don't consider their memory model to be an error.
    I am pretty sure that no Alpha implementation ever made use of the
    lack of causality in the Alpha memory model, so they could have added
    causality without outlawing existing implementations. That they did
    not indicates that they thought that their memory model was right. An
    advocacy paper for weak memory models [adve&gharachorloo95] came from
    the same place as Alpha, so it's no surprise that Alpha specifies weak consistency.

    @TechReport{adve&gharachorloo95,
    author = {Sarita V. Adve and Kourosh Gharachorloo},
    title = {Shared Memory Consistency Models: A Tutorial},
    institution = {Digital Western Research Lab},
    year = {1995},
    type = {WRL Research Report},
    number = {95/7},
    annote = {Gives an overview of architectural features of
    shared-memory computers such as independent memory
    banks and per-CPU caches, and how they make the (for
    programmers) most natural consistency model hard to
    implement, giving examples of programs that can fail
    with weaker consistency models. It then discusses
    several categories of weaker consistency models and
    actual consistency models in these categories, and
    which ``safety net'' (e.g., memory barrier
    instructions) programmers need to use to work around
    the deficiencies of these models. While the authors
    recognize that programmers find it difficult to use
    these safety nets correctly and efficiently, it
    still advocates weaker consistency models, claiming
    that sequential consistency is too inefficient, by
    outlining an inefficient implementation (which is of
    course no proof that no efficient implementation
    exists). Still the paper is a good introduction to
    the issues involved.}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 15 03:17:22 2024
    From Newsgroup: comp.arch

    On 11/14/2024 11:25 PM, Anton Ertl wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
    only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
    Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the operations
    of each individual processor appear in this sequence in the order
    specified by its program."
    [...]


    Well, iirc, the Alpha is the only system that requires an explicit
    membar for a RCU based algorithm. Even SPARC in RMO mode does not need
    this. Iirc, akin to memory_order_consume in C++:

    https://en.cppreference.com/w/cpp/atomic/memory_order

    data dependent loads





    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Nov 15 15:22:07 2024
    From Newsgroup: comp.arch

    On Fri, 15 Nov 2024 07:25:12 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
    only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
    Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the operations
    of each individual processor appear in this sequence in the order
    specified by its program."


    Of course, it's not enough for SC.
    What you said holds, for example, for TSO and even by some memory
    ordering models that a weaker than TSO.
    The points of SC is that in addition to that it requires for any two
    stores by different agents to be observed in the same order by all
    agents in the system, including those two.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Nov 15 15:24:59 2024
    From Newsgroup: comp.arch

    On Fri, 15 Nov 2024 03:17:22 -0800
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

    On 11/14/2024 11:25 PM, Anton Ertl wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack
    of guarantees in memory ordering is a bad idea, and so is ARM's:
    "It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously? Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the
    operations of each individual processor appear in this sequence in
    the order specified by its program."
    [...]


    Well, iirc, the Alpha is the only system that requires an explicit
    membar for a RCU based algorithm. Even SPARC in RMO mode does not
    need this. Iirc, akin to memory_order_consume in C++:

    https://en.cppreference.com/w/cpp/atomic/memory_order

    data dependent loads


    You response does not answer Anton's question.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Nov 15 14:13:27 2024
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack of >guarantees in memory ordering is a bad idea, and so is ARM's: "It's
    only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
    Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the operations
    of each individual processor appear in this sequence in the order
    specified by its program."

    However, I don't think that the Alpha architects considered the Alpha
    memory ordering to be an error, and probably still don't, just like
    the ARM architects don't consider their memory model to be an error.
    I am pretty sure that no Alpha implementation ever made use of the
    lack of causality in the Alpha memory model, so they could have added >causality without outlawing existing implementations. That they did
    not indicates that they thought that their memory model was right. An >advocacy paper for weak memory models [adve&gharachorloo95] came from
    the same place as Alpha, so it's no surprise that Alpha specifies weak >consistency.

    Perhaps one might ask Dr. Kessler?

    https://acg.cis.upenn.edu/milom/cis501-Fall09/papers/Alpha21264.pdf

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jseigh@jseigh_es00@xemaps.com to comp.arch on Fri Nov 15 11:08:29 2024
    From Newsgroup: comp.arch

    On 11/15/2024 2:25 AM, Anton Ertl wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
    only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
    Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the operations
    of each individual processor appear in this sequence in the order
    specified by its program."

    However, I don't think that the Alpha architects considered the Alpha
    memory ordering to be an error, and probably still don't, just like
    the ARM architects don't consider their memory model to be an error.
    I am pretty sure that no Alpha implementation ever made use of the
    lack of causality in the Alpha memory model, so they could have added causality without outlawing existing implementations. That they did
    not indicates that they thought that their memory model was right. An advocacy paper for weak memory models [adve&gharachorloo95] came from
    the same place as Alpha, so it's no surprise that Alpha specifies weak consistency.

    @TechReport{adve&gharachorloo95,
    author = {Sarita V. Adve and Kourosh Gharachorloo},
    title = {Shared Memory Consistency Models: A Tutorial},
    institution = {Digital Western Research Lab},
    year = {1995},
    type = {WRL Research Report},
    number = {95/7},
    annote = {Gives an overview of architectural features of
    shared-memory computers such as independent memory
    banks and per-CPU caches, and how they make the (for
    programmers) most natural consistency model hard to
    implement, giving examples of programs that can fail
    with weaker consistency models. It then discusses
    several categories of weaker consistency models and
    actual consistency models in these categories, and
    which ``safety net'' (e.g., memory barrier
    instructions) programmers need to use to work around
    the deficiencies of these models. While the authors
    recognize that programmers find it difficult to use
    these safety nets correctly and efficiently, it
    still advocates weaker consistency models, claiming
    that sequential consistency is too inefficient, by
    outlining an inefficient implementation (which is of
    course no proof that no efficient implementation
    exists). Still the paper is a good introduction to
    the issues involved.}
    }

    - anton

    Anybody doing that sort of programming, i.e. lock-free or distributed algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place. Strongly
    consistent memory won't help incompetence.

    Joe Seigh
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 15 17:19:34 2024
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 15 Nov 2024 07:25:12 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack of
    guarantees in memory ordering is a bad idea, and so is ARM's: "It's
    only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
    Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the operations
    of each individual processor appear in this sequence in the order
    specified by its program."


    Of course, it's not enough for SC.
    What you said holds, for example, for TSO and even by some memory
    ordering models that a weaker than TSO.
    The points of SC is that in addition to that it requires for any two
    stores by different agents to be observed in the same order by all
    agents in the system, including those two.

    That's included in the statement I cited: stores are operations, and
    the behaviour is the same as executing all thoperations in some
    sequential order. I.e., all processors observe everything they
    observe with the same results. I have this definition from <https://en.wikipedia.org/wiki/Sequential_consistency>, which cites
    the following source for it: Leslie Lamport, "How to Make a
    Multiprocessor Computer That Correctly Executes Multiprocess
    Programs", IEEE Trans. Comput. C-28,9 (Sept. 1979), 690-691.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 15 17:27:37 2024
    From Newsgroup: comp.arch

    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 15 12:42:15 2024
    From Newsgroup: comp.arch

    On 11/15/2024 5:24 AM, Michael S wrote:
    On Fri, 15 Nov 2024 03:17:22 -0800
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

    On 11/14/2024 11:25 PM, Anton Ertl wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack
    of guarantees in memory ordering is a bad idea, and so is ARM's:
    "It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
    Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the
    operations of each individual processor appear in this sequence in
    the order specified by its program."
    [...]


    Well, iirc, the Alpha is the only system that requires an explicit
    membar for a RCU based algorithm. Even SPARC in RMO mode does not
    need this. Iirc, akin to memory_order_consume in C++:

    https://en.cppreference.com/w/cpp/atomic/memory_order

    data dependent loads


    You response does not answer Anton's question.


    I guess not. Shit happens. ;^o
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 15 12:48:46 2024
    From Newsgroup: comp.arch

    On 11/15/2024 9:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Well, if one can't handle the memory barriers, say wrt
    std::memory_order_* ala C++ . well. that is a problem wrt creating these "exotic" types of algorithms. Imvvho, that is.


    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    For instance, a 100% sequential memory order won't help you with, say,
    solving ABA.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Fri Nov 15 14:53:00 2024
    From Newsgroup: comp.arch

    On 11/15/2024 11:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?


    In my case, as I see it:
    The tradeoff is more about implementation cost, performance, etc.

    Weak model:
    Cheaper (and simpler) to implement;
    Performs better when there is no need to synchronize memory;
    Performs worse when there is need to synchronize memory;
    ...



    However, local to the CPU core:
    Not respecting things like RAW hazards does not seem well advised.


    Like, if we store to a location, and then immediately read back from it,
    one can expect to see the most recently written value, not the previous
    value. Or, if one stores to two adjacent memory locations, one expects
    that both stores write the data correctly.

    Granted, it is a tradeoff:
    Not bothering: Fast, Cheap, but may break expected behavior;
    Could naively use NOPs if aliasing is possible, but this is bad.
    Add an interlock check, stall the pipeline if it happens:
    Works, but can add a noticeable performance penalty;
    My attempts at 75 and 100 MHz cores had often done this;
    Sadly, memory RAW and WAW hazards are not exactly rare.
    Use internal forwarding, so written data is used directly next cycle.
    Better performance;
    But, has a fairly high cost for the FPGA (*1).



    *1: This factor (along with L1 cache sizes) weighs in heavily to why I continue to use 50MHz. Otherwise, I could use 75 MHz, but this internal forwarding logic, and L1 caches with 32K of BRAM (excluding metadata)
    and 1-cycle access, are not really viable at 75 MHz.

    For the L2 cache, which is much bigger, one can use a few extra
    pad-cycles to access the Block-RAM array. Though, 5 cycle latency for Load/Store operations would be, not good.

    Can note that with Block-RAM, usual behavior seems to be that if one
    tries to read from one port while writing to another port on the same
    clock edge, if both are at the same location, the prior contents will be returned. This may be a general behavior in Verilog though, rather than
    a Block-RAM thing (also seems to apply to LUTRAM if accessed in the same pattern; though LUTRAM allows also reading the value via combinatorial
    logic rather than a clock-edge, which seems to always return the value
    from the most recent clock-edge).


    As I can note, a 4K or 8K L1 cache with stall on RAW or WAW, at 75 MHz,
    tends to perform worse IME, than a 32K cache running at 50 MHz with no
    RAW/WAW stall.

    Also, trying to increase MHz by increasing instruction latency in many
    cases was also not ideal for performance.


    Granted, if I were to do things the "DEC Alpha" way, I probably could
    run stuff at 75MHz, but then would likely need the compiler to insert a
    bunch of strategic NOPs so that the program doesn't break.


    For memory ordering, possibly, in my case a case could be made for an
    "order respecting DRAM cache" via the MMIO interface, say:
    F000_01000000..F000_3FFFFFFF

    Could be defined to alias with the main RAM map, but with strictly
    sequential ordering for every memory access across all cores (at the
    expense of performance).

    Where:
    0000_00000000..7FFF_FFFFFFFF: Virtual Address Space
    8000_00000000..BFFF_FFFFFFFF: Supervisor-Only Virtual Address Space
    C000_00000000..CFFF_FFFFFFFF: Physical Address Space, Default Caching
    D000_00000000..DFFF_FFFFFFFF: Physical Address Space, Volatile/NoCache
    E000_00000000..EFFF_FFFFFFFF: Reserved
    F000_00000000..FFFF_FFFFFFFF: MMIO Space

    MMIO space is currently fully independent of RAM space.

    However, at present:
    FFFF_F0000000..FFFF_FFFFFFFF: MMIO Space, as Used for MMIO devices.

    So, in theory, remerging RAM IO space into MMIO Space would be possible
    (well, except that trying to access HW MMIO address ranges via RAM-space access would likely be disallowed).


    Can note, MMU disabled:
    0000_00000000..0FFF_FFFFFFFF: Same as C000..CFFF space.
    1000_00000000..7FFF_FFFFFFFF: Invalid

    ...

    Granted, current scheme does set a limit of 16TB of RAM.
    But, biggest FPGA boards I have only have 256MB, so, ...

    And, current VA map within TestKern (from memory):
    0000_00000000..0000_00FFFFFF: NULL Space
    0000_01000000..0000_3FFFFFFF: RAM Range (Identity Mapped)
    0000_40000000..0000_BFFFFFFF: Direct Page Mapping (no swap)
    0001_00000000..3FFF_FFFFFFFF: Mapped to swapfile, Global
    4000_00000000..7FFF_FFFFFFFF: Process Local


    Note that, within the RAM-range, the RAM will wrap around. The specifics
    of the wraparound are used to detect RAM size (this would set an
    effective limit at 512MB, after which no wraparound would be detected).

    Specifics here would need to change if larger RAM sizes were supported.

    Not sure how RAM size is detected with DIMM modules. IIRC, with PCs, it
    was more probe along linearly until one finds an address that no longer returns valid data (say, if one hits the 1GB mark, and gets back 000000
    or FFFFFFF or similar, assume end of RAM at 1GB).


    One does need to make sure caches (including L2 cache) are flushed
    during all this, as the caches doing their usual cache thing, may
    incorrectly detect larger RAM than actually exists.


    ...


    - anton

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Nov 15 14:05:57 2024
    From Newsgroup: comp.arch

    On 11/15/2024 12:53 PM, BGB wrote:
    On 11/15/2024 11:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?


    In my case, as I see it:
      The tradeoff is more about implementation cost, performance, etc.

    Weak model:
      Cheaper (and simpler) to implement;
      Performs better when there is no need to synchronize memory;
      Performs worse when there is need to synchronize memory;
      ...
    [...]

    A TSO from a weak memory model is as it is. It should not necessarily
    perform "worse" than other systems that have TSO as a default. The
    weaker models give us flexibility. Any weak memory model should be able
    to give sequential consistency via using the right membars in the right places.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Fri Nov 15 17:35:22 2024
    From Newsgroup: comp.arch

    On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:
    On 11/15/2024 12:53 PM, BGB wrote:
    On 11/15/2024 11:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't >>>> be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?


    In my case, as I see it:
       The tradeoff is more about implementation cost, performance, etc.

    Weak model:
       Cheaper (and simpler) to implement;
       Performs better when there is no need to synchronize memory;
       Performs worse when there is need to synchronize memory;
       ...
    [...]

    A TSO from a weak memory model is as it is. It should not necessarily perform "worse" than other systems that have TSO as a default. The
    weaker models give us flexibility. Any weak memory model should be able
    to give sequential consistency via using the right membars in the right places.


    The speed difference is mostly that, in a weak model, the L1 cache
    merely needs to fetch memory from the L2 or similar, may write to it
    whenever, and need not proactively store back results.

    As I understand it, a typical TSO like model will require, say:
    Any L1 cache that wants to write to a cache line, needs to explicitly
    request write ownership over that cache line;
    Any attempt by other cores to access this line, may require the L2 cache
    to send a message to the core currently holding the cache line for
    writing to write back its contents, with the request unable to be
    handled until after the second core has written back the dirty cache line.

    This would create potential for significantly more latency in cases
    where multiple cores touch the same part of memory; albeit the cores
    will see each others' memory stores.


    So, initially, weak model can be faster due to not needing any
    additional handling.


    But... Any synchronization points, such as a barrier or locking or
    releasing a mutex, will require manually flushing the cache with a weak
    model. And, locking/releasing the mutex itself will require a mechanism
    that is consistent between cores (such as volatile atomic swaps or
    similar, which may still be weak as a volatile-atomic-swap would still
    not be atomic from the POV of the L2 cache; and an MMIO interface could
    be stronger here).


    Seems like there could possibly be some way to skip some of the cache
    flushing if one could verify that a mutex is only being locked and
    unlocked on a single core.

    Issue then is how to deal with trying to lock a mutex which has thus far
    been exclusive to a single core. One would need some way for the core
    that last held the mutex to know that it needs to perform an L1 cache flush.

    Though, one possibility could be to leave this part to the OS scheduler/syscall/... mechanism; so the core that wants to lock the
    mutex signals its intention to do so via the OS, and the next time the
    core that last held the mutex does a syscall (or tries to lock the mutex again), the handler sees this, then performs the L1 flush and flags the
    mutex as multi-core safe (at which point, the parties will flush L1s at
    each mutex lock, though possibly with a timeout count so that, if the
    mutex has been single-core for N locks, it reverts to single-core behavior).

    This could reduce the overhead of "frivolous mutex locking" in programs
    that are otherwise single-threaded or single processor (leaving the
    cache flushes for the ones that are in-fact being used for
    synchronization purposes).

    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Nov 16 00:51:36 2024
    From Newsgroup: comp.arch

    On Fri, 15 Nov 2024 23:35:22 +0000, BGB wrote:

    On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:
    On 11/15/2024 12:53 PM, BGB wrote:
    On 11/15/2024 11:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?


    In my case, as I see it:
       The tradeoff is more about implementation cost, performance, etc.

    Weak model:
       Cheaper (and simpler) to implement;
       Performs better when there is no need to synchronize memory;
       Performs worse when there is need to synchronize memory;
       ...
    [...]

    A TSO from a weak memory model is as it is. It should not necessarily
    perform "worse" than other systems that have TSO as a default. The
    weaker models give us flexibility. Any weak memory model should be able
    to give sequential consistency via using the right membars in the right
    places.


    The speed difference is mostly that, in a weak model, the L1 cache
    merely needs to fetch memory from the L2 or similar, may write to it whenever, and need not proactively store back results.

    As I understand it, a typical TSO like model will require, say:
    Any L1 cache that wants to write to a cache line, needs to explicitly
    request write ownership over that cache line;

    The cache line may have been fetched from a core which modified the
    data, and handed this line directly to this requesting core on a
    typical read. So, it is possible for the line to show up with
    write permission even if the requesting core did not ask for write
    permission. So, not all lines being written have to request owner-
    ship.

    Any attempt by other cores to access this line,

    You are being rather loose with your time analysis in this question::

    Access this line before write permission has been requested,
    or
    Access this line after write permission has been requested but
    before it has arrived,
    or
    Access this line after write permission has arrived.

    may require the L2 cache
    to send a message to the core currently holding the cache line for
    writing to write back its contents, with the request unable to be
    handled until after the second core has written back the dirty cache
    line.

    L2 has to know something about how L1 has the line, and likely which
    core cache the data is in.

    This would create potential for significantly more latency in cases
    where multiple cores touch the same part of memory; albeit the cores
    will see each others' memory stores.

    One can ARGUE that this is a good thing as it makes latency part
    of the memory access model. More interfering accesses=higher
    latency.


    So, initially, weak model can be faster due to not needing any
    additional handling.


    But... Any synchronization points, such as a barrier or locking or
    releasing a mutex, will require manually flushing the cache with a weak model.

    Not necessarily:: My 66000 uses causal memory consistency, yet when
    an ATOMIC event begins it reverts to sequential consistency until
    the end of the event where it reverts back to causal. Use of MMI/O
    space reverts to sequential consistency, while access to config
    space reverts all the way back to strongly ordered.

    And, locking/releasing the mutex itself will require a mechanism
    that is consistent between cores (such as volatile atomic swaps or
    similar, which may still be weak as a volatile-atomic-swap would still
    not be atomic from the POV of the L2 cache; and an MMIO interface could
    be stronger here).


    Seems like there could possibly be some way to skip some of the cache flushing if one could verify that a mutex is only being locked and
    unlocked on a single core.

    Issue then is how to deal with trying to lock a mutex which has thus far
    been exclusive to a single core. One would need some way for the core
    that last held the mutex to know that it needs to perform an L1 cache
    flush.

    This seems to be a job for Cache Consistency.

    Though, one possibility could be to leave this part to the OS scheduler/syscall/...

    The OS wants nothing to do with this.

    mechanism; so the core that wants to lock the
    mutex signals its intention to do so via the OS, and the next time the
    core that last held the mutex does a syscall (or tries to lock the mutex again), the handler sees this, then performs the L1 flush and flags the
    mutex as multi-core safe (at which point, the parties will flush L1s at
    each mutex lock, though possibly with a timeout count so that, if the
    mutex has been single-core for N locks, it reverts to single-core
    behavior).

    This could reduce the overhead of "frivolous mutex locking" in programs
    that are otherwise single-threaded or single processor (leaving the
    cache flushes for the ones that are in-fact being used for
    synchronization purposes).

    ....
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sat Nov 16 00:39:42 2024
    From Newsgroup: comp.arch

    On 11/15/2024 6:51 PM, MitchAlsup1 wrote:
    On Fri, 15 Nov 2024 23:35:22 +0000, BGB wrote:

    On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:
    On 11/15/2024 12:53 PM, BGB wrote:
    On 11/15/2024 11:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >>>>>> algorithms, who can't handle weakly consistent memory models,
    shouldn't
    be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?


    In my case, as I see it:
       The tradeoff is more about implementation cost, performance, etc. >>>>
    Weak model:
       Cheaper (and simpler) to implement;
       Performs better when there is no need to synchronize memory;
       Performs worse when there is need to synchronize memory;
       ...
    [...]

    A TSO from a weak memory model is as it is. It should not necessarily
    perform "worse" than other systems that have TSO as a default. The
    weaker models give us flexibility. Any weak memory model should be able
    to give sequential consistency via using the right membars in the right
    places.


    The speed difference is mostly that, in a weak model, the L1 cache
    merely needs to fetch memory from the L2 or similar, may write to it
    whenever, and need not proactively store back results.

    As I understand it, a typical TSO like model will require, say:
    Any L1 cache that wants to write to a cache line, needs to explicitly
    request write ownership over that cache line;

    The cache line may have been fetched from a core which modified the
    data, and handed this line directly to this requesting core on a
    typical read. So, it is possible for the line to show up with
    write permission even if the requesting core did not ask for write permission. So, not all lines being written have to request owner-
    ship.


    OK.

    I think the bigger distinction, is more that a concept of write
    ownership exists in the first place...


    In my current memory model, there is no concept of write ownership.

    Ironically, this also means the RISC-V LR/SC instructions don't make
    sense in my memory model, but this hasn't been a huge loss (they just
    sort of behave as-if they worked).


    Any attempt by other cores to access this line,

    You are being rather loose with your time analysis in this question::

    Access this line before write permission has been requested,
    or
    Access this line after write permission has been requested but
    before it has arrived,
    or
    Access this line after write permission has arrived.


    Yeah. I didn't really distinguish these cases...

    May possibly be different in a cache system where events are processed sequentially, rather than circling around in a ring bus (and processed
    in whatever way the requests happen to hit the L2 cache or similar).


    Say, request comes in for address 123 from core B:
    Write ownership held by A?
    Send request to A to Flush 123;
    Flag 123 as the flush having been requested;
    To avoid repeating the request.
    Ignore B's request for now (it then circles the bus);
    Write ownership not held?
    If the request was for write privilege:
    Mark as held by B;
    Send response to B's request.

    If A receives a flush request:
    Flush the cache line in question;
    Write modified data, or sense an FLUSH_ACK response or similar.

    When L2 receives response:
    Write data back to L2 if needed;
    Mark cache line as no longer held.

    Less obvious what happens if an L2 miss happens and the line at that
    location is still held.
    Would presumably need all cores to flush any dirty lines before they
    could be safely evicted from the L2 cache (in my current design, this
    scenario is ignored).



                                                   may require the L2 cache
    to send a message to the core currently holding the cache line for
    writing to write back its contents, with the request unable to be
    handled until after the second core has written back the dirty cache
    line.

    L2 has to know something about how L1 has the line, and likely which
    core cache the data is in.


    Yeah.

    More bookkeeping needed here...


    Possibly though, L2 may not need to track the specific core, if it can
    send out a general message:
    "Whoever holds line 123 needs to flush it."

    Message then has a special behavior in that it circles the whole bus
    without taking any shortcut paths, and is then removed once it gets back around to the L2 cache (after presumably every other node on the bus has
    seen it), and or gets replaced by the appropriate ACK (if it hits an L1
    cache that is holding the line in question).

    Specifics likely to differ here between a message-ring bus, and other
    types of bus.

    Possibly the comparably high latency of a message ring would not be
    ideal in this case.



    One other possibility for a bus could be be a star-network, where
    message can either be point-to-point or broadcast. Say, point-to-point
    being used if both locations have a known address, and broadcast
    messages sent to every node on the bus.


    Unclear if "hubs" on this bus would either need to know which "ports" correspond to which node address ranges, or simply broadcast any
    incoming message on all ports. Broadcast with no buffering would be cheapest/simplest, but would have overhead, and a potential for
    "collision" (where two nodes send a message at the same time, but don't
    yet see the other's message).

    Likely, each hub would need a FIFO and basic message routing, but this
    would add cost (per-node cost is likely to be higher than that of
    forwarding messages along a ring).


    But, there could be merit, say, if messages could get anywhere on the
    bus within a relatively small number of clock cycles.


    This would create potential for significantly more latency in cases
    where multiple cores touch the same part of memory; albeit the cores
    will see each others' memory stores.

    One can ARGUE that this is a good thing as it makes latency part
    of the memory access model. More interfering accesses=higher
    latency.


    OK.



    So, initially, weak model can be faster due to not needing any
    additional handling.


    But... Any synchronization points, such as a barrier or locking or
    releasing a mutex, will require manually flushing the cache with a weak
    model.

    Not necessarily:: My 66000 uses causal memory consistency, yet when
    an ATOMIC event begins it reverts to sequential consistency until
    the end of the event where it reverts back to causal. Use of MMI/O
    space reverts to sequential consistency, while access to config
    space reverts all the way back to strongly ordered.


    In my case, RAM like and MMIO use different messaging protocols...

    Not currently any scheme in place to support consistency modeling for
    RAM like access.

    MMIO is ordered mostly as the L1 cache will not let anything more happen
    until it gets a response (so, the L1 cache forces sequential operation
    on its end). On the other end, the bridge to the MMIO bus will become
    "busy" and not respond to any more requests until the currently active
    request has been completed (so, it is a serialized "first come, first
    serve" as far as message arrival on the ringbus).

    Atomic operations on the bus could likely be formed as a special form of
    MMIO SWAP request (with a few bits somewhere used to encode which
    operator to perform). Well, unless the only supported atomic operator is
    SWAP.

    Likely it would depend on the target device for whether or not atomic operators are allowed.





           And, locking/releasing the mutex itself will require a mechanism
    that is consistent between cores (such as volatile atomic swaps or
    similar, which may still be weak as a volatile-atomic-swap would still
    not be atomic from the POV of the L2 cache; and an MMIO interface could
    be stronger here).


    Seems like there could possibly be some way to skip some of the cache
    flushing if one could verify that a mutex is only being locked and
    unlocked on a single core.

    Issue then is how to deal with trying to lock a mutex which has thus far
    been exclusive to a single core. One would need some way for the core
    that last held the mutex to know that it needs to perform an L1 cache
    flush.

    This seems to be a job for Cache Consistency.


    Possibly so...


    Though, one possibility could be to leave this part to the OS
    scheduler/syscall/...

    The OS wants nothing to do with this.


    Unclear how to best deal with it...

    Status quo:
    Lock/release using system calls;
    System calls always perform L1 flush
    ( ... if there were more than 1 core ... ).

    Faster:
    Lock/Release handled purely in userland;
    Delay or avoid cache flushes.

    Hybrid:
    Try to have a fast-path in userland ("local core only" mutexes);
    Fall back to syscalls if not fast-path.

    Lazy hybrid:
    Lock/release continue using system calls;
    Nothing changes as far as userland cares.
    Try to delay the L1 flushes.
    Say, to save the ~20k clock-cycles this process eats.


    Lazy flushing on syscalls and scheduler events seems possible, as
    (assuming the core isn't frozen) this will happen eventually.

    Does mean a scenario can occur (where a previously assumed local-only
    mutex is in-fact non-local) could take an unreasonably long time to deal
    with (one core needing to wait until the other core does a system call
    or similar).


    Note that if a mutex lock happens, and can't be handled immediately,
    general behavior is to mark the task as waiting on a mutex and then
    switch to a different task (this is otherwise similar to how calls like "usleep()" are handled, task can be resumed once mutex is no longer held).



    Though, for now, TestKern is still purely single-processor.

    But, not much motivation to invest in multicore TestKern when I can
    still generally only fit a single core on the XC7A100T.

    Can at least go dual core on the XC7A200T, but hadn't really made any
    use of it (so the second core sits around mostly doing nothing in this
    case).



    Where, in the single core case, no real way to handle mutexes other than
    to reschedule the task.

    So, some of this is still kinda theoretical.

    Admittedly, it wasn't until fairly recently that TestKern got preemptive
    task scheduling. And, even then, there were still a lot of
    race-condition type bugs early on (well, partly stemming from the
    general lack of mutexes in many cases; and pretty much entirely absent
    in the kernel because, as-is, there is no way to actually resolve a
    mutex conflict in the kernel should one occur...).

    Well, and for userland, I ended up with generally using "reschedule on syscalls" rather than "reschedule on timer IRQ", as "reschedule on
    syscalls" was slightly less prone to result in the sorts of race
    conditions that caused stuff to break (only uses timer IRQ as a fallback
    if the task has managed to hold the CPU for an unreasonable amount of time).


    But, yeah, a lot is still "in theory" for now, actual state of TestKern
    still kinda sucks on this front...


                          mechanism; so the core that wants to lock the
    mutex signals its intention to do so via the OS, and the next time the
    core that last held the mutex does a syscall (or tries to lock the mutex
    again), the handler sees this, then performs the L1 flush and flags the
    mutex as multi-core safe (at which point, the parties will flush L1s at
    each mutex lock, though possibly with a timeout count so that, if the
    mutex has been single-core for N locks, it reverts to single-core
    behavior).

    This could reduce the overhead of "frivolous mutex locking" in programs
    that are otherwise single-threaded or single processor (leaving the
    cache flushes for the ones that are in-fact being used for
    synchronization purposes).


    The cost of mutex locking could almost be ignored...

    Until of course people are trying to use otherwise frivolous mutex locks
    to protect things that are only ever accessed by a single thread (as has
    sort of become the style in many codebases), etc.


    Or, say, burning extra clock-cycles in the name of "malloc()" being thread-safe (even if, much of the time, the mutex hiding inside the malloc/free calls or similar isn't actually protecting anything).


    ....

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 16 07:37:44 2024
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 9:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    For instance, a 100% sequential memory order won't help you with, say, >solving ABA.

    Sure, not all problems are solved by sequential consistency, and yes,
    it won't solve race conditions like the ABA problem. But jseigh
    implied that finding it easier to write correct and efficient code for sequential consistency than for a weakly-consistent memory model
    (e.g., Alphas memory model) is incompetent.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 16 07:46:17 2024
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    The tradeoff is more about implementation cost, performance, etc.

    Yes. And the "etc." includes "ease of programming".

    Weak model:
    Cheaper (and simpler) to implement;

    Yes.

    Performs better when there is no need to synchronize memory;

    Not in general. For a cheap multiprocessor implementation, yes. A sophisticated implementation of sequential consistency can just storm
    ahead in that case and achieve the same performance. It just has to
    keep checkpoints around in case that there is a need to synchronize
    memory.

    Performs worse when there is need to synchronize memory;

    With a cheap multiprocessor implementation, yes. In general, no: Any sequentially consistent implementation is also an implementation of
    every weaker memory model, and the memory barriers become nops in that
    kind of implementation. Ok, nops still have a cost, but it's very
    close to 0 on a modern CPU.

    Another potential performance disadvantage of sequential consistency
    even with a sophisticated implementation:

    If you have some algorithm that actually works correctly even when it
    gets stale data from a load (with some limits on the staleness), the sophisticated SC implementation will incur the latency coming from
    making the load non-stale while that latency will not occur or be less
    in a similarly-sophisticated implementation of an appropriate weak
    consistency model.

    However, given that the access to actually-shared memory is slow even
    on weakly-consistent hardware, software usually takes measures to
    avoid having a lot of such accesses, so that cost will usually be
    miniscule.


    What you missed: the big cost of weak memory models and cheap hardware implementations of them is in the software:

    * For correctness, the safe way is to insert a memory barrier between
    any two memory operations.

    * For performance (on cheap implementations of weak memory models) you
    want to execute as few memory barriers as possible.

    * You cannot use testing to find out whether you have enough (and the
    right) memory barriers. That's not only because the involved
    threads may not be in the right state during testing for uncovering
    the incorrectness, but also because the hardware used for testing
    may actually have stronger consistency than the memory model, and so
    some kinds of bugs will never show up in testing on that hardware,
    even when the threads reach the right state. And testing is still
    the go-to solution for software people to find errors (nowadays even
    glorified by continuous integration and modern fuzz testing
    approaches).

    The result is that a lot of software dealing with shared memory is
    incorrect because it does not have a memory barrier that it should
    have, or inefficient on cheap hardware with expensive memory barriers
    because it uses more memory barriers than necessary for the memory
    model. A program may even be incorrect in one place and have
    superflouous memory barriers in another one.

    Or programmers just don't do this stuff at all (as advocated by
    jseigh), and instead just write sequential programs, or use bottled
    solutions that often are a lot more expensive than superfluous memory
    barriers. E.g., in Gforth the primary inter-thread communication
    mechanism is currently implemented with pipes, involving the system
    calls read() and write(). And Bernd Paysan who implemented that is a
    really good programmer; I am sure he would be able to wrap his head
    around the whole memory model stuff and implement something much more efficient, but that would take time that he obviously prefers to spend
    on more productive things.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 16 08:58:40 2024
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification?
    ...
    Perhaps one might ask Dr. Kessler?

    https://acg.cis.upenn.edu/milom/cis501-Fall09/papers/Alpha21264.pdf

    I don't think that anything in the 21264 core would result in the
    Alpha-unique inconsistency; the only core mechanisms that I can think
    of where that would be relevant is value prediction, and the 21264
    does not do that.

    Looking at the memory subsystems of bigger Alpha systems might be more relevant.

    There is a good reason to suspect that the Alpha architects imagined
    hardware that actually did not appear: They did not specify hardware
    byte and 16-bit memory accesses with the justification that a
    first-level write-back cache would require ECC in DEC machines, and
    ECC for bytes (or read-modify-write for keeping ECC on larger units)
    is supposedly too expensive. However, the Alphas without BWX
    instructions (everything up to EV5, but EV56 and later acquired BWX)
    never had a first-level write-back cache.

    And the EV6 which has a first-level write-back cache, implements the
    BWX instructions, so the reasoning against BWX obviously does not hold
    water. Reading on page 31 of the paper above, the 21264 (EV6) uses read-modify-write for updating the ECC data.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 16 13:21:48 2024
    From Newsgroup: comp.arch

    On 11/15/2024 12:42 PM, Chris M. Thomasson wrote:
    On 11/15/2024 5:24 AM, Michael S wrote:
    On Fri, 15 Nov 2024 03:17:22 -0800
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

    On 11/14/2024 11:25 PM, Anton Ertl wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification?  I certainly think that Alpha's lack
    of guarantees in memory ordering is a bad idea, and so is ARM's:
    "It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>.  Seriously?
    Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the
    operations of each individual processor appear in this sequence in
    the order specified by its program."
    [...]


    Well, iirc, the Alpha is the only system that requires an explicit
    membar for a RCU based algorithm. Even SPARC in RMO mode does not
    need this. Iirc, akin to memory_order_consume in C++:

    https://en.cppreference.com/w/cpp/atomic/memory_order

    data dependent loads


    You response does not answer Anton's question.


    I guess not. Shit happens. ;^o

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a
    compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
    mb for alpha? Cannot remember that one right now.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 16 13:23:41 2024
    From Newsgroup: comp.arch

    On 11/16/2024 1:21 PM, Chris M. Thomasson wrote:
    On 11/15/2024 12:42 PM, Chris M. Thomasson wrote:
    On 11/15/2024 5:24 AM, Michael S wrote:
    On Fri, 15 Nov 2024 03:17:22 -0800
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

    On 11/14/2024 11:25 PM, Anton Ertl wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do >>>>>> that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification?  I certainly think that Alpha's lack
    of guarantees in memory ordering is a bad idea, and so is ARM's:
    "It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>.  Seriously? >>>>> Sequential consistency can be specified in one sentence: "The result >>>>> of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the
    operations of each individual processor appear in this sequence in
    the order specified by its program."
    [...]


    Well, iirc, the Alpha is the only system that requires an explicit
    membar for a RCU based algorithm. Even SPARC in RMO mode does not
    need this. Iirc, akin to memory_order_consume in C++:

    https://en.cppreference.com/w/cpp/atomic/memory_order

    data dependent loads


    You response does not answer Anton's question.


    I guess not. Shit happens. ;^o

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
    mb for alpha? Cannot remember that one right now.

    I think, iirc, there is a way to use an acquire membar on the loading of
    the initial node of a collection to iterate it without using memory_order_consume for every node. I might be wrong on that. It's been
    a while!
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 16 13:28:06 2024
    From Newsgroup: comp.arch

    On 11/15/2024 10:39 PM, BGB wrote:
    [...]
    Hybrid:
      Try to have a fast-path in userland ("local core only" mutexes);
      Fall back to syscalls if not fast-path.

    Are you familiar with adaptive mutex logic? I know a lot about
    fast-paths in userland before they need to wait in the kernel on a
    contended mutex, or empty condition for a lock-free stack or something...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Sat Nov 16 21:59:01 2024
    From Newsgroup: comp.arch

    On Fri, 15 Nov 2024 07:25:12 GMT, Anton Ertl wrote:

    @TechReport{adve&gharachorloo95,
    author = {Sarita V. Adve and Kourosh Gharachorloo}, title =
    {Shared Memory Consistency Models: A Tutorial},
    institution = {Digital Western Research Lab},
    year = {1995},
    type = {WRL Research Report},
    number = {95/7},
    ...

    Available online at Bitsavers <http://bitsavers.trailing-edge.com/pdf/dec/tech_reports/WRL-95-7.pdf>
    and mirrors.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Nov 16 14:10:15 2024
    From Newsgroup: comp.arch

    On 11/15/2024 11:37 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 9:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't >>>> be doing that sort of programming in the first place.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    For instance, a 100% sequential memory order won't help you with, say,
    solving ABA.

    Sure, not all problems are solved by sequential consistency, and yes,
    it won't solve race conditions like the ABA problem. But jseigh
    implied that finding it easier to write correct and efficient code for sequential consistency than for a weakly-consistent memory model
    (e.g., Alphas memory model) is incompetent.

    What if you had to write code for a weakly ordered system, and the
    performance guidelines said to only use a membar when you absolutely
    have to. If you say something akin to "I do everything using std::memory_order_seq_cst", well, that is a violation right off the bat.

    Fair enough?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 16 22:28:12 2024
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    Yeah, I have absolutely no issue with ideas that are only obvious in hindsight, they deserve praise. My real problem are those things that
    are new, but only because of the environment, as in the idea would be obvious to anyone "versed in the field".

    "skilled in the art" (which has a nice ring to it) is the technical term
    in English (see
    https://www.epo.org/en/legal/guidelines-epc/2024/g_vii_3.html ).

    But presence or lack of an inventive step can be quite difficult.
    Examiners have argued that "This solution is so simple, somebody
    must have come across it before"...

    In one particular case, a colleague and myself were cooperating
    closely on a related group of inventions. It was quite amusing
    that one particular point was quite obvious to him, which I found
    out when I told him about my "new" finding. Even more amusing
    was that the same thing happened vice versa - something that was
    completely obvious to me wasn't obvious to him at all, and needed
    an explanation.

    I.e US vs Norwegian (European?) patent law.

    I think the US has now gotten closer in patent law to what the
    rest of the world is doing.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Sat Nov 16 17:28:21 2024
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    jseigh <jseigh_es00@xemaps.com> writes:

    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    It isn't a claim, just an opinion.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jseigh@jseigh_es00@xemaps.com to comp.arch on Sun Nov 17 09:03:06 2024
    From Newsgroup: comp.arch

    On 11/16/24 16:21, Chris M. Thomasson wrote:

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
    mb for alpha? Cannot remember that one right now.

    That got deprecated. Too hard for compilers to deal with. It's now
    same as memory_order_acquire.

    Which brings up an interesting point. Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect. Or maybe disable reordering or optimization altogether
    for those target architectures.

    Joe Seigh

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 17 15:15:08 2024
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 11:37 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 9:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    For instance, a 100% sequential memory order won't help you with, say,
    solving ABA.

    Sure, not all problems are solved by sequential consistency, and yes,
    it won't solve race conditions like the ABA problem. But jseigh
    implied that finding it easier to write correct and efficient code for
    sequential consistency than for a weakly-consistent memory model
    (e.g., Alphas memory model) is incompetent.

    What if you had to write code for a weakly ordered system, and the >performance guidelines said to only use a membar when you absolutely
    have to. If you say something akin to "I do everything using >std::memory_order_seq_cst", well, that is a violation right off the bat.

    Fair enough?

    Are you trying to support my point?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 17 15:17:52 2024
    From Newsgroup: comp.arch

    jseigh <jseigh_es00@xemaps.com> writes:
    Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.

    That's something between the user of a programming language and the
    compiler. If you use a programming language or compiler that gives
    weaker memory ordering guarantees than the architecture it compiles
    to, that's your choice. Nothing forces compilers to behave that way,
    and it's actually easier to write compilers that do not do such
    reordering.

    Or maybe disable reordering or optimization altogether
    for those target architectures.

    So you want to throw out the baby with the bathwater.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Nov 17 13:30:10 2024
    From Newsgroup: comp.arch

    On 11/17/2024 6:03 AM, jseigh wrote:
    On 11/16/24 16:21, Chris M. Thomasson wrote:

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a
    compiler barrier. On the Alpha, it must emit a membar instruction.
    Iirc, mb for alpha? Cannot remember that one right now.

    That got deprecated.  Too hard for compilers to deal with.  It's now
    same as memory_order_acquire.

    Strange! C++ has:

    https://en.cppreference.com/w/cpp/atomic/atomic_signal_fence

    That only deals with compilers, not the arch memory order... Humm...

    Interesting Joe!


    Which brings up an interesting point.  Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.   Or maybe disable reordering or optimization altogether
    for those target architectures.

    Indeed. The compiler needs to know about these things. Iirc, there was
    an old post over c.p.t that deals with a compiler (think it was GCC)
    that messed up a pthread try lock for a mutex. It's a very old post. But
    I remember it for sure.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Nov 17 15:34:08 2024
    From Newsgroup: comp.arch

    On 11/17/2024 7:15 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 11:37 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 9:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >>>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>>> be doing that sort of programming in the first place.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    For instance, a 100% sequential memory order won't help you with, say, >>>> solving ABA.

    Sure, not all problems are solved by sequential consistency, and yes,
    it won't solve race conditions like the ABA problem. But jseigh
    implied that finding it easier to write correct and efficient code for
    sequential consistency than for a weakly-consistent memory model
    (e.g., Alphas memory model) is incompetent.

    What if you had to write code for a weakly ordered system, and the
    performance guidelines said to only use a membar when you absolutely
    have to. If you say something akin to "I do everything using
    std::memory_order_seq_cst", well, that is a violation right off the bat.

    Fair enough?

    Are you trying to support my point?

    I am trying to say you might not be hired if you only knew how to handle std::memory_order_seq_cst wrt C++... ?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Nov 18 07:11:04 2024
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    What if you had to write code for a weakly ordered system, and the
    performance guidelines said to only use a membar when you absolutely
    have to. If you say something akin to "I do everything using
    std::memory_order_seq_cst", well, that is a violation right off the bat. ...
    I am trying to say you might not be hired if you only knew how to handle >std::memory_order_seq_cst wrt C++... ?

    I am not looking to be hired.

    In any case, this cuts both ways: If you are an employer working on multi-threaded software, say, for Windows or Linux, will you reduce
    your pool of potential hires by including a requirement like the one
    above? And then pay for longer development time and additional
    hard-to-find bugs coming from overshooting the requirement you stated
    above. Or do you limit your software support to TSO hardware (for
    lack of widely available SC hardware), and gain all the benefits of
    more potential hires, reduced development time, and fewer bugs?

    I have compared arguments against strong memory ordering with those
    against floating-point. Von Neumann argued for fixed point as follows <https://booksite.elsevier.com/9780124077263/downloads/historial%20perspectives/section_3.11.pdf>:

    |[...] human time is consumed in arranging for the introduction of
    |suitable scale factors. We only argue that the time consumed is a
    |very small percentage of the total time we will spend in preparing an |interesting problem for our machine. The first advantage of the
    |floating point is, we feel, somewhat illusory. In order to have such
    |a floating point, one must waste memory capacity which could
    |otherwise be used for carrying more digits per word.

    Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:

    |Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
    |seemed to imply that 40-bit arithmetic would hardly ever deliver
    |usable accuracy for the solution of so few as 100 linear equations in
    |100 unknowns; but by 1954 engineers were solving bigger systems
    |routinely and getting satisfactory accuracy from arithmetics with no
    |more than 40 bits.

    The flaw in the reasoning of the paper was:

    |To solve it more easily without floating–point von Neumann had
    |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
    |doubling the number of sig. bits lost to ill-condition

    This is an example of how the supposed gains that the harder-to-use
    interface provides (in this case the bits "wasted" on the exponent)
    are overcompensated by then having to use a software workaround for
    the harder-to-use interface.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From aph@aph@littlepinkcloud.invalid to comp.arch on Mon Nov 18 11:56:48 2024
    From Newsgroup: comp.arch

    jseigh <jseigh_es00@xemaps.com> wrote:
    On 11/16/24 16:21, Chris M. Thomasson wrote:

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a
    compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
    mb for alpha? Cannot remember that one right now.

    That got deprecated. Too hard for compilers to deal with. It's now
    same as memory_order_acquire.

    It's back in C++20. I think the problem wasn't so much implementing
    it, which as you say can be trivially done by aliasing with acquire,
    but specifying it. We use load dependency ordering in Java on AArch64
    to satisfy some memory model requirements, so it's not as if it's
    difficult to use.

    Which brings up an interesting point. Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.

    Yes, exactly. It's not as if this is an issue that affects people who
    program in high level languages, it's about what language implementers
    choose to do.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From aph@aph@littlepinkcloud.invalid to comp.arch on Mon Nov 18 12:03:55 2024
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification?

    I don't know. Given the contortions that the Linux kernel people had
    to go through, maybe it really was present in hardware.

    As a programming language implementer, I don't much think about "Will
    the hardware really do this?" because new hardware arises all the
    time, and I don't want users' programs to stop working.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 18 15:20:57 2024
    From Newsgroup: comp.arch

    On 11/17/2024 11:11 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    What if you had to write code for a weakly ordered system, and the
    performance guidelines said to only use a membar when you absolutely
    have to. If you say something akin to "I do everything using
    std::memory_order_seq_cst", well, that is a violation right off the bat.
    ...
    I am trying to say you might not be hired if you only knew how to handle
    std::memory_order_seq_cst wrt C++... ?

    I am not looking to be hired.

    In any case, this cuts both ways: If you are an employer working on multi-threaded software, say, for Windows or Linux, will you reduce
    your pool of potential hires by including a requirement like the one
    above? And then pay for longer development time and additional
    hard-to-find bugs coming from overshooting the requirement you stated
    above. Or do you limit your software support to TSO hardware (for
    lack of widely available SC hardware), and gain all the benefits of
    more potential hires, reduced development time, and fewer bugs?

    I have compared arguments against strong memory ordering with those
    against floating-point. Von Neumann argued for fixed point as follows <https://booksite.elsevier.com/9780124077263/downloads/historial%20perspectives/section_3.11.pdf>:

    |[...] human time is consumed in arranging for the introduction of
    |suitable scale factors. We only argue that the time consumed is a
    |very small percentage of the total time we will spend in preparing an |interesting problem for our machine. The first advantage of the
    |floating point is, we feel, somewhat illusory. In order to have such
    |a floating point, one must waste memory capacity which could
    |otherwise be used for carrying more digits per word.

    Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:

    |Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
    |seemed to imply that 40-bit arithmetic would hardly ever deliver
    |usable accuracy for the solution of so few as 100 linear equations in
    |100 unknowns; but by 1954 engineers were solving bigger systems
    |routinely and getting satisfactory accuracy from arithmetics with no
    |more than 40 bits.

    The flaw in the reasoning of the paper was:

    |To solve it more easily without floating–point von Neumann had |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
    |doubling the number of sig. bits lost to ill-condition

    This is an example of how the supposed gains that the harder-to-use
    interface provides (in this case the bits "wasted" on the exponent)
    are overcompensated by then having to use a software workaround for
    the harder-to-use interface.

    well, if you used std::memory_order_seq_cst to implement, say, a mutex
    and/or spinlock memory barrier logic, well, that would raise a red flag
    in my mind... Not good.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 18 15:34:03 2024
    From Newsgroup: comp.arch

    On 11/18/2024 3:20 PM, Chris M. Thomasson wrote:
    On 11/17/2024 11:11 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    What if you had to write code for a weakly ordered system, and the
    performance guidelines said to only use a membar when you absolutely >>>>> have to. If you say something akin to "I do everything using
    std::memory_order_seq_cst", well, that is a violation right off the >>>>> bat.
    ...
    I am trying to say you might not be hired if you only knew how to handle >>> std::memory_order_seq_cst wrt C++... ?

    I am not looking to be hired.

    In any case, this cuts both ways: If you are an employer working on
    multi-threaded software, say, for Windows or Linux, will you reduce
    your pool of potential hires by including a requirement like the one
    above?  And then pay for longer development time and additional
    hard-to-find bugs coming from overshooting the requirement you stated
    above.  Or do you limit your software support to TSO hardware (for
    lack of widely available SC hardware), and gain all the benefits of
    more potential hires, reduced development time, and fewer bugs?

    I have compared arguments against strong memory ordering with those
    against floating-point.  Von Neumann argued for fixed point as follows
    <https://booksite.elsevier.com/9780124077263/downloads/
    historial%20perspectives/section_3.11.pdf>:

    |[...] human time is consumed in arranging for the introduction of
    |suitable scale factors. We only argue that the time consumed is a
    |very small percentage of the total time we will spend in preparing an
    |interesting problem for our machine. The first advantage of the
    |floating point is, we feel, somewhat illusory. In order to have such
    |a floating point, one must waste memory capacity which could
    |otherwise be used for carrying more digits per word.

    Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:

    |Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
    |seemed to imply that 40-bit arithmetic would hardly ever deliver
    |usable accuracy for the solution of so few as 100 linear equations in
    |100 unknowns; but by 1954 engineers were solving bigger systems
    |routinely and getting satisfactory accuracy from arithmetics with no
    |more than 40 bits.

    The flaw in the reasoning of the paper was:

    |To solve it more easily without floating–point von Neumann had
    |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
    |doubling the number of sig. bits lost to ill-condition

    This is an example of how the supposed gains that the harder-to-use
    interface provides (in this case the bits "wasted" on the exponent)
    are overcompensated by then having to use a software workaround for
    the harder-to-use interface.

    well, if you used std::memory_order_seq_cst to implement, say, a mutex and/or spinlock memory barrier logic, well, that would raise a red flag
    in my mind... Not good.

    Don't tell me you want all of std::memory_order_* to default to std::memory_order_seq_cst? If your on a system that only has seq_cst and nothing else, okay, but not on other weaker (memory order) systems, right?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 18 19:37:50 2024
    From Newsgroup: comp.arch

    On 11/17/2024 7:17 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.

    That's something between the user of a programming language and the
    compiler. If you use a programming language or compiler that gives
    weaker memory ordering guarantees than the architecture it compiles
    to, that's your choice. Nothing forces compilers to behave that way,
    and it's actually easier to write compilers that do not do such
    reordering.

    Or maybe disable reordering or optimization altogether
    for those target architectures.

    So you want to throw out the baby with the bathwater.

    No, keep the weak order systems and not throw them out wrt a system that
    is 100% seq_cst? Perhaps? What am I missing here?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Nov 18 19:38:41 2024
    From Newsgroup: comp.arch

    On 11/17/2024 1:30 PM, Chris M. Thomasson wrote:
    On 11/17/2024 6:03 AM, jseigh wrote:
    On 11/16/24 16:21, Chris M. Thomasson wrote:

    Fwiw, in C++ std::memory_order_consume is useful for traversing a
    node based stack of something in RCU. In most systems it only acts
    like a compiler barrier. On the Alpha, it must emit a membar
    instruction. Iirc, mb for alpha? Cannot remember that one right now.

    That got deprecated.  Too hard for compilers to deal with.  It's now
    same as memory_order_acquire.

    Strange! C++ has:

    https://en.cppreference.com/w/cpp/atomic/atomic_signal_fence

    That only deals with compilers, not the arch memory order... Humm...

    Interesting Joe!


    Which brings up an interesting point.  Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.   Or maybe disable reordering or optimization altogether
    for those target architectures.

    Indeed. The compiler needs to know about these things. Iirc, there was
    an old post over c.p.t that deals with a compiler (think it was GCC)
    that messed up a pthread try lock for a mutex. It's a very old post. But
    I remember it for sure.


    a song for contention:

    https://youtu.be/Sdq4T3iRV80?list=RDMMy3hf0T4qpYg

    ;^D
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Nov 20 14:03:29 2024
    From Newsgroup: comp.arch

    On 11/17/2024 6:03 AM, jseigh wrote:
    On 11/16/24 16:21, Chris M. Thomasson wrote:

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a
    compiler barrier. On the Alpha, it must emit a membar instruction.
    Iirc, mb for alpha? Cannot remember that one right now.

    That got deprecated.  Too hard for compilers to deal with.  It's now
    same as memory_order_acquire.

    Which brings up an interesting point.  Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.



    Or maybe disable reordering or optimization altogether
    for those target architectures.
    ^^^^^^^^^^^^

    Yeah. No shit. Some people say the std::memory_order_* is too complex.
    Others says its okay. Shit happens.

    I remember way before C++11 I would feel nervous about keeping link time optimizations on because I thought they mess around with my custom
    assembly language code for my sensitive thread sync algorithms. I
    thought that a compiler would say okay this is calling into externally assembled/compiled code. No optimizations.... Well, link time
    optimization scared me.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Nov 20 14:08:11 2024
    From Newsgroup: comp.arch

    On 11/17/2024 6:03 AM, jseigh wrote:
    On 11/16/24 16:21, Chris M. Thomasson wrote:

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a
    compiler barrier. On the Alpha, it must emit a membar instruction.
    Iirc, mb for alpha? Cannot remember that one right now.

    That got deprecated.  Too hard for compilers to deal with.  It's now
    same as memory_order_acquire.

    Horrible! So on the SPARC in RMO mode if I used
    std::memory_order_consume to traverse a linked data structure in RCU it
    would use a god damn #LoadStore | #LoadLoad for every node load? If so,
    YIKES! Might as well use a single acquire after load the head of the list.




    Which brings up an interesting point.  Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.   Or maybe disable reordering or optimization altogether
    for those target architectures.

    Joe Seigh


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Thu Nov 21 05:46:56 2024
    From Newsgroup: comp.arch

    In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>,
    <aph@littlepinkcloud.invalid> wrote:
    Kent Dickey <kegs@provalid.com> wrote:

    Even better, let's look at the actual words for Pick Basic Dependency:

    ---
    Pick Basic Dependency:
    There is a Pick Basic dependency from an effect E1 to an effect
    E2 if one of the following applies:
    1) One of the following applies:
    a) E1 is an Explicit Memory Read effect
    b) E1 is a Register Read effect
    2) One of the following applies:
    a) There is a Pick dependency through registers and memory
    from E1 to E2
    b) E1 and E2 are the same effect

    I don't understand this. However, here are the actual words:

    Pick Basic dependency

    A Pick Basic dependency from a read Register effect or read Memory
    effect R1 to a Register effect or Memory effect E2 exists if one
    of the following applies:

    • There is a Dependency through registers and memory from R1 to E2.
    • There is an Intrinsic Control dependency from R1 to E2.
    • There is a Pick Basic dependency from R1 to an Effect E3 and
    there is a Pick Basic dependency from E3 to E2.

    Seems reasonable enough in context, no? It's either a data dependency,
    a control dependency, or any transitive combination of them.

    Andrew.

    Where did you get that from? I cannot find it in the current Arm document DDI0487K_a_a-profile-architecture_reference_manual.pdf. Get it from https://developer.arm.com/documentation/ddi0487/ka/?lang=en

    My text for Pick Basic dependency is a quote (where I label the lines
    1a,1b, etc., where it's just bullets in the Arm document) from page B2-239, middle of the page.

    That sort of "summary" was exactly what I was asking for, but I don't see it, so can you please name the page?

    I'm pretty sure there are confusing typos all through this section
    (E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
    was a doozy.

    It's likely the wording was better in an earlier document, I've noticed
    this section getting more opaque over time.

    Kent
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From aph@aph@littlepinkcloud.invalid to comp.arch on Thu Nov 21 17:41:33 2024
    From Newsgroup: comp.arch

    Kent Dickey <kegs@provalid.com> wrote:
    In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>, <aph@littlepinkcloud.invalid> wrote:
    Kent Dickey <kegs@provalid.com> wrote:

    Even better, let's look at the actual words for Pick Basic Dependency:

    ---
    Pick Basic Dependency:
    There is a Pick Basic dependency from an effect E1 to an effect
    E2 if one of the following applies:
    1) One of the following applies:
    a) E1 is an Explicit Memory Read effect

    You're right, they do seem to have forgotten to define Explicit Memory
    Read effect. I'm sure they meant to.

    b) E1 is a Register Read effect
    2) One of the following applies:
    a) There is a Pick dependency through registers and memory >>> from E1 to E2
    b) E1 and E2 are the same effect

    I don't understand this. However, here are the actual words:

    Pick Basic dependency

    A Pick Basic dependency from a read Register effect or read Memory
    effect R1 to a Register effect or Memory effect E2 exists if one
    of the following applies:

    . There is a Dependency through registers and memory from R1 to E2. >> . There is an Intrinsic Control dependency from R1 to E2.
    . There is a Pick Basic dependency from R1 to an Effect E3 and
    there is a Pick Basic dependency from E3 to E2.

    Seems reasonable enough in context, no? It's either a data dependency,
    a control dependency, or any transitive combination of them.

    Where did you get that from? I cannot find it in the current Arm document DDI0487K_a_a-profile-architecture_reference_manual.pdf. Get it from https://developer.arm.com/documentation/ddi0487/ka/?lang=en

    Err, the previous version of the same document. :-)

    My text for Pick Basic dependency is a quote (where I label the lines
    1a,1b, etc., where it's just bullets in the Arm document) from page B2-239, middle of the page.

    That sort of "summary" was exactly what I was asking for, but I don't see it, so can you please name the page?

    B2-174 in DDI0487J

    I'm pretty sure there are confusing typos all through this section
    (E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
    was a doozy.

    It's likely the wording was better in an earlier document, I've noticed
    this section getting more opaque over time.

    So it seems. I think everything in DDI0487J was meant to be there in
    DDI0487K, but it looks like it's all been macro-expanded and some
    things fell off the page, because reasons. I believe the author of the
    earlier, easier-to-read version of the Memory Model left Arm for
    another company. If it's any consolation, the version of the MM before
    he rewrote it was absolutely incomprehensible.

    Andrew.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Nov 22 15:45:20 2024
    From Newsgroup: comp.arch

    aph@littlepinkcloud.invalid writes:
    Kent Dickey <kegs@provalid.com> wrote:
    In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>,
    <aph@littlepinkcloud.invalid> wrote:
    Kent Dickey <kegs@provalid.com> wrote:

    Even better, let's look at the actual words for Pick Basic Dependency:

    That sort of "summary" was exactly what I was asking for, but I don't see it,
    so can you please name the page?

    B2-174 in DDI0487J

    I'm pretty sure there are confusing typos all through this section
    (E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
    was a doozy.

    It's likely the wording was better in an earlier document, I've noticed
    this section getting more opaque over time.

    So it seems. I think everything in DDI0487J was meant to be there in >DDI0487K, but it looks like it's all been macro-expanded and some
    things fell off the page, because reasons.

    Between DDI0487G and DDI0487H, they completely rewrote the ARM
    using a requirements based description rather than the straightforward
    prose in prior editions.

    They've been wordsmithing it in every subsequent version.

    I consider the prose version to be more readable, myself.


    --- Synchronet 3.20a-Linux NewsLink 1.114