• MM instruction and the pipeline

    From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Oct 15 22:56:34 2024
    From Newsgroup: comp.arch

    Even though this is about the MM instruction, and the MM instruction is mentioned in other threads, they have lots of other stuff (thread
    drift), and this isn't related to C, standard or otherwise, so I thought
    it best to start a new thread,

    My questions are about what happens to subsequent instructions that immediately follow the MM in the stream when an MM instruction is
    executing. Since an MM instruction may take quite a long time (in
    computer time) to complete I think it is useful to know what else can
    happen while the MM is executing.

    I will phrase this as a series of questions.

    1. I assume that subsequent non-memory reference instructions can
    proceed simultaneously with the MM. Is that correct?

    2. Can a load or store where the memory address is in neither the source nor the destination of the MM proceed simultaneously with the MM

    3. Can a load where the memory address is within the source of the MM proceed?

    For the next questions, assume for exposition that the MM has proceeded
    to complete 1/3 of the move when the following instructions come up.

    4. Can a load in the first third of the destination range proceed?

    5. Can a store in the first third of the source range proceed?

    6. Can a store in the first third of the destination range proceed?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 19:26:46 2024
    From Newsgroup: comp.arch

    On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

    Even though this is about the MM instruction, and the MM instruction is mentioned in other threads, they have lots of other stuff (thread
    drift), and this isn't related to C, standard or otherwise, so I thought
    it best to start a new thread,

    My questions are about what happens to subsequent instructions that immediately follow the MM in the stream when an MM instruction is
    executing. Since an MM instruction may take quite a long time (in
    computer time) to complete I think it is useful to know what else can
    happen while the MM is executing.

    I will phrase this as a series of questions.

    1. I assume that subsequent non-memory reference instructions can
    proceed simultaneously with the MM. Is that correct?

    Yes, they may begin but they cannot retire.

    2. Can a load or store where the memory address is in neither the source nor the destination of the MM proceed simultaneously with the MM

    Yes, in higher end implementations--after checking for no-conflict
    {and this is dependent on accessing DRAM not MMI/O or config spaces}

    3. Can a load where the memory address is within the source of the MM proceed?

    It is just read data, so, yes--at least theoretically.

    For the next questions, assume for exposition that the MM has proceeded
    to complete 1/3 of the move when the following instructions come up.

    4. Can a load in the first third of the destination range proceed?

    5. Can a store in the first third of the source range proceed?

    6. Can a store in the first third of the destination range proceed?

    In all 3 of these cases; one much have a good way to determine what has
    already been MMed and what is waiting to be MMed. A low end
    implementation
    is unlikely to have such, a high end will have such.

    On the other hand, MM is basically going to saturate the cache ports
    (if for no other reason than being as fast as it can be) so, there
    may not be a lot of AGEN capability or cache access port availability.
    So, the faster one makes MM (and by extension MS) the less one needs
    of overlap and pipelining.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Wed Oct 16 16:48:39 2024
    From Newsgroup: comp.arch

    On 10/16/24 1:56 AM, Stephen Fuld wrote:
    Even though this is about the MM instruction, and the MM
    instruction is mentioned in other threads, they have lots of other
    stuff (thread drift), and this isn't related to C, standard or
    otherwise, so I thought it best to start a new thread,

    My questions are about what happens to subsequent instructions
    that immediately follow the MM in the stream when an MM
    instruction is executing.  Since an MM instruction may take quite
    a long time (in computer time) to complete I think it is useful to
    know what else can happen while the MM is executing.

    This would seem to be very implementation dependent.
    Architecturally, no following instructions can execute until after
    the MM completes. With respect to microarchitecture, an arbitrary
    amount of parallelism could be provided.


    I will phrase this as a series of questions.

    While Mitch Alsup can answer these more authoritatively, I will
    take a stab at them.

    1.    I assume that subsequent non-memory reference instructions
    can proceed simultaneously with the MM.  Is that correct?

    This would probably be true even for the in-order scalar
    implementation.
    2.    Can a load or store where the memory address is in neither
    the source nor the destination of the MM proceed simultaneously
    with the MM

    This is a little more complicated than just marking a register as
    not-ready (for a load destination), so might not be supported in
    a simple implementation. Memory accesses would have to check both
    ranges rather than just one of 32 register names or eight store
    buffer entries.

    Mitch Alsup's description of the small quasi-scalar core implies
    to me that the MM instruction would occupy the memory access
    interface until it is finished.

    I would guess that any out-of-order implementation would support
    loads and stores outside of the MM regions to proceed
    speculatively until the various OoO buffering structures are
    filled.

    3.    Can a load where the memory address is within the source of
    the MM proceed?

    My guess would be that any OoO implementation would support this.
    If the implementation checks for a hit in both ranges, it would
    seem to be little extra effort to allow a load to a 'clean'
    address to proceed.

    Supporting this and preventing reads of the destination and all
    stores would only require one address range check; loads can
    proceed as long as they are not within the destination.

    For the next questions, assume for exposition that the MM has
    proceeded to complete 1/3 of the move when the following
    instructions come up.

    4.    Can a load in the first third of the destination range proceed?

    I would guess that an out-of-order implementation would forward
    data from all stores performed speculatively by the MM (limited by
    the store queue). MM stores that are no longer speculative — where
    an interrupt would place the count — would seem to be naturally
    handled as if singular committed stores, i.e., following
    instructions could speculatively execute using those values.

    5.    Can a store in the first third of the source range proceed?

    In the non-speculative region of the MM, speculative stores could
    "execute", storing to the store queue. These stores would be
    squashed if the MM does not fully complete along with all other
    instructions after the MM. The MM is synchronous.

    A large MM that is no longer speculative might be implemented as
    avoiding the store queue to allow more stores after the MM to be
    speculated. For very large MMs, a copy engine farther from the
    core might be used.

    6.    Can a store in the first third of the destination range
    proceed?

    Since the MM has architecturally completed to roughly that point
    (some stores might only have "completed" to the store queue), it
    would not be difficult to support speculative stores in the
    completed range for an out-of-order implementation. These stores
    would be rolled back if the MM does not fully complete and commit.


    Here is a question that I will leave to Mitch:

    Can a MM that has confirmed permissions commit before it has been
    performed such that uncorrectable errors would be recognized not
    on read of the source but on later read of the destination?

    I could see some wanting to depend on the copy checking data
    validity synchronously, but some might be okay with a quasi-
    synchronous copy that allows the processor to continue doing work
    outside of the MM.

    If a translation map is provided for coherence, any MM could
    commit once it is not speculative but before the actual copy has
    been performed. Tracking what parts have been completed in the
    presence of other stores would have significant overhead.

    For page-aligned copies, a copy-on-write mechanism might be used.

    There are also cache designs which support deduplication; cache
    block aligned copies might be faster than physical copying. With lossy/truncated cache compression, unaligned fragments might be
    deduplicated (and read-for-ownership might be avoided similar to
    having fine-grained valid bits).

    I rather suspect that what is physically possible is far broader
    than what is possible with a finite engineering budget.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 21:14:37 2024
    From Newsgroup: comp.arch

    On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:


    Here is a question that I will leave to Mitch:

    Can a MM that has confirmed permissions commit before it has been
    performed such that uncorrectable errors would be recognized not
    on read of the source but on later read of the destination?

    A Memory Move does not <necessarily> read the destination. In
    order to make the data transfers occur in cache line sizes,
    The first and the last line may be read, but the intermediate
    ones are not read (from DRAM) only to be re-written. An
    implementation with byte write enables might not read any
    of the destination lines.

    Then there is the issue with uncorrectable errors at the
    receiving cache. The current protocol has the sender (core)
    not release his write buffer until LLC has replied that
    the data arrived without ECC trouble. Thus, the instruction
    causing the latent uncorrectable error is not retired until
    the data has arrived successfully at LLC.

    I could see some wanting to depend on the copy checking data
    validity synchronously, but some might be okay with a quasi-
    synchronous copy that allows the processor to continue doing work
    outside of the MM.

    As I mentioned before, Yes I intend to allow other instructions
    to operate concurrently with MM, but I also expect MM to consume
    all of L1 cache bandwidth. Just like LD L1-L2-miss operates
    concurrently with FDIV.

    If a translation map is provided for coherence, any MM could
    commit once it is not speculative but before the actual copy has
    been performed. Tracking what parts have been completed in the
    presence of other stores would have significant overhead.

    In practice, one is not going to allow MM to get farther than
    the miss buffer ahead of a mispredict shadow.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Oct 17 08:49:08 2024
    From Newsgroup: comp.arch

    On 10/16/2024 12:26 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

    Even though this is about the MM instruction, and the MM instruction is
    mentioned in other threads, they have lots of other stuff (thread
    drift), and this isn't related to C, standard or otherwise, so I thought
    it best to start a new thread,

    My questions are about what happens to subsequent instructions that
    immediately follow the MM in the stream when an MM instruction is
    executing.  Since an MM instruction may take quite a long time (in
    computer time) to complete I think it is useful to know what else can
    happen while the MM is executing.

    I will phrase this as a series of questions.

    1.    I assume that subsequent non-memory reference instructions can
    proceed simultaneously with the MM.  Is that correct?

    Yes, they may begin but they cannot retire.

    2.    Can a load or store where the memory address is in neither the
    source
    nor the destination of the MM proceed simultaneously with the MM

    Yes, in higher end implementations--after checking for no-conflict
    {and this is dependent on accessing DRAM not MMI/O or config spaces}

    3.    Can a load where the memory address is within the source of the MM >> proceed?

    It is just read data, so, yes--at least theoretically.

    For the next questions, assume for exposition that the MM has proceeded
    to complete 1/3 of the move when the following instructions come up.

    4.    Can a load in the first third of the destination range proceed?

    5.    Can a store in the first third of the source range proceed?

    6.    Can a store in the first third of the destination range proceed?

    In all 3 of these cases; one much have a good way to determine what has already been MMed and what is waiting to be MMed. A low end
    implementation
    is unlikely to have such, a high end will have such.

    On the other hand, MM is basically going to saturate the cache ports
    (if for no other reason than being as fast as it can be) so, there
    may not be a lot of AGEN capability or cache access port availability.


    Yes, but. For a large transfer, say many hundreds to thousands of
    bytes, why run the "middle" bytes through the cache, especially the L1
    (as you indicated in reply to Paul)? It would take some analysis of
    traces to know for sure, but I would expect the probability of reuse of
    such bytes to be low. If that is true, it would take far less resources
    (and avoid "sweeping" the cache) to do at least the intermediate reads
    and writes into just L3, or even a dedicated very small buffer or two. Furthermore, for the transfers after the first, unless there is a page crossing, why go through a full AGEN, when a simple add to the previous address is all that is required, thus freeing AGEN resources.


    So, the faster one makes MM (and by extension MS) the less one needs
    of overlap and pipelining.

    Certainly true for small transfers, but for larger ones, I am not so
    sure. It may make more sense to delay the MM completion slightly for
    the time it takes for a single load to take place in order to allow the non-memory reference instructions following that load to execute
    overlapped with the completion of the MM. Needs trace analysis.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Oct 17 13:16:24 2024
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    On 10/16/2024 12:26 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

    Even though this is about the MM instruction, and the MM instruction is
    mentioned in other threads, they have lots of other stuff (thread
    drift), and this isn't related to C, standard or otherwise, so I thought >>> it best to start a new thread,

    My questions are about what happens to subsequent instructions that
    immediately follow the MM in the stream when an MM instruction is
    executing. Since an MM instruction may take quite a long time (in
    computer time) to complete I think it is useful to know what else can
    happen while the MM is executing.

    I will phrase this as a series of questions.

    1. I assume that subsequent non-memory reference instructions can
    proceed simultaneously with the MM. Is that correct?

    Yes, they may begin but they cannot retire.

    2. Can a load or store where the memory address is in neither the
    source
    nor the destination of the MM proceed simultaneously with the MM

    Yes, in higher end implementations--after checking for no-conflict
    {and this is dependent on accessing DRAM not MMI/O or config spaces}

    3. Can a load where the memory address is within the source of the MM >>> proceed?

    It is just read data, so, yes--at least theoretically.

    For the next questions, assume for exposition that the MM has proceeded
    to complete 1/3 of the move when the following instructions come up.

    4. Can a load in the first third of the destination range proceed?

    5. Can a store in the first third of the source range proceed?

    6. Can a store in the first third of the destination range proceed?

    In all 3 of these cases; one much have a good way to determine what has
    already been MMed and what is waiting to be MMed. A low end
    implementation
    is unlikely to have such, a high end will have such.

    On the other hand, MM is basically going to saturate the cache ports
    (if for no other reason than being as fast as it can be) so, there
    may not be a lot of AGEN capability or cache access port availability.


    Yes, but. For a large transfer, say many hundreds to thousands of
    bytes, why run the "middle" bytes through the cache, especially the L1
    (as you indicated in reply to Paul)? It would take some analysis of
    traces to know for sure, but I would expect the probability of reuse of
    such bytes to be low. If that is true, it would take far less resources (and avoid "sweeping" the cache) to do at least the intermediate reads
    and writes into just L3, or even a dedicated very small buffer or two. Furthermore, for the transfers after the first, unless there is a page crossing, why go through a full AGEN, when a simple add to the previous address is all that is required, thus freeing AGEN resources.


    So, the faster one makes MM (and by extension MS) the less one needs
    of overlap and pipelining.

    Certainly true for small transfers, but for larger ones, I am not so
    sure. It may make more sense to delay the MM completion slightly for
    the time it takes for a single load to take place in order to allow the non-memory reference instructions following that load to execute
    overlapped with the completion of the MM. Needs trace analysis.

    MM is a (long) series of byte LD and ST to virtual addresses.
    The ordering rules for MM relative to scalar LD and ST before and after it should be no different, other than the exact order of individual MM bytes
    moved is not defined other than it is overlap safe.

    But the same bypassing and forwarding rules apply.
    E.g. Under TSO, MM being a sequence of stores cannot start until it is at
    the end of the Load Store Queue and ready to retire. So all older LD and ST must have retired and we only need consider interactions of MM with younger
    LD and ST.

    If an implementation allows a younger LD [x] to bypass an older
    MM &dst, &src, then LSQ must retain the LD [x] someplace so it can
    check if at some later time, perhaps far later, that a MM store to the
    physical address of &dst overlaps the physical address of [x].
    If it does overlap then LSQ must trigger a replay of the younger LD [x]
    so it picks up the new value from dst buffer.

    A younger store ST to any address cannot be seen to bypass an older MM
    store to any address, though it could prefetch.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 17 21:49:19 2024
    From Newsgroup: comp.arch

    On Thu, 17 Oct 2024 17:16:24 +0000, EricP wrote:

    Stephen Fuld wrote:
    On 10/16/2024 12:26 PM, MitchAlsup1 wrote:

    MM is a (long) series of byte LD and ST to virtual addresses.
    The ordering rules for MM relative to scalar LD and ST before and after
    it should be no different, other than the exact order of individual MM
    bytes moved is not defined other than it is overlap safe.

    But the same bypassing and forwarding rules apply.
    E.g. Under TSO, MM being a sequence of stores cannot start until it is
    at the end of the Load Store Queue and ready to retire. So all older LD
    and
    ST must have retired and we only need consider interactions of MM with younger LD and ST.

    My 66000 is NOT TSO it is causal unless special areas are being
    accessed.

    But, yes, MM must operate as if it is ordered with other memory
    reference
    instructions.

    If an implementation allows a younger LD [x] to bypass an older
    MM &dst, &src, then LSQ must retain the LD [x] someplace so it can
    check if at some later time, perhaps far later, that a MM store to the physical address of &dst overlaps the physical address of [x].
    If it does overlap then LSQ must trigger a replay of the younger LD [x]
    so it picks up the new value from dst buffer.

    No essential disagreement.

    A younger store ST to any address cannot be seen to bypass an older MM
    store to any address, though it could prefetch.

    Once again it is not TSO.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sun Oct 20 11:57:59 2024
    From Newsgroup: comp.arch

    On 10/16/24 5:14 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:


    Here is a question that I will leave to Mitch:

    Can a MM that has confirmed permissions commit before it has been
    performed such that uncorrectable errors would be recognized not
    on read of the source but on later read of the destination?

    A Memory Move does not <necessarily> read the destination. In
    order to make the data transfers occur in cache line sizes,
    The first and the last line may be read, but the intermediate
    ones are not read (from DRAM) only to be re-written. An
    implementation with byte write enables might not read any
    of the destination lines.

    I was referring to a following instruction reading the
    destination.

    Then there is the issue with uncorrectable errors at the
    receiving cache. The current protocol has the sender (core)
    not release his write buffer until LLC has replied that
    the data arrived without ECC trouble. Thus, the instruction
    causing the latent uncorrectable error is not retired until
    the data has arrived successfully at LLC.

    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    I could see some wanting to depend on the copy checking data
    validity synchronously, but some might be okay with a quasi-
    synchronous copy that allows the processor to continue doing work
    outside of the MM.

    As I mentioned before, Yes I intend to allow other instructions
    to operate concurrently with MM, but I also expect MM to consume
    all of L1 cache bandwidth. Just like LD L1-L2-miss operates
    concurrently with FDIV.

    For large copies, I could see having the copying done at L2 or
    even L3 with distinct address generation (at least within a page,
    possibly crossing page boundaries if associated with a prefetcher
    that crosses page boundaries and so does address translation).
    A stride based prefetcher would have the address generation
    capability to process the streaming access of MM.

    If a translation map is provided for coherence, any MM could
    commit once it is not speculative but before the actual copy has
    been performed. Tracking what parts have been completed in the
    presence of other stores would have significant overhead.

    In practice, one is not going to allow MM to get farther than
    the miss buffer ahead of a mispredict shadow.

    Once the MM itself is non-speculative (i.e., branches/exceptions
    in the path to it have all resolved), it seems an MM could
    progress as far as permissions have been confirmed.

    With its synchronous interface, it seems that for sufficiently
    large MM operations one might want to context switch the MM off of
    a high performance core and onto a MM engine so that the core
    could be used for other work.

    Given the cache misses (including predictors) from a context
    switch, this might not be worthwhile even if the MM engine was
    substantially more energy efficient.

    Forcing a thread to pause while a "large" MM operation is done
    _feels_ wrong.

    I guess one could architect a fork+terminate operation that
    would allow a non-speculative but physically incomplete copy to
    commit and then "end" that thread while continuing in a "new"
    thread from the instruction after the fork. Such an interface
    seems clunky but might be more general than just having a
    quasi-synchronous MM with the similar behavior.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Oct 21 00:25:57 2024
    From Newsgroup: comp.arch

    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    On 10/16/24 5:14 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:


    Here is a question that I will leave to Mitch:

    Can a MM that has confirmed permissions commit before it has been
    performed such that uncorrectable errors would be recognized not
    on read of the source but on later read of the destination?

    A Memory Move does not <necessarily> read the destination. In
    order to make the data transfers occur in cache line sizes,
    The first and the last line may be read, but the intermediate
    ones are not read (from DRAM) only to be re-written. An
    implementation with byte write enables might not read any
    of the destination lines.

    I was referring to a following instruction reading the
    destination.

    Then there is the issue with uncorrectable errors at the
    receiving cache. The current protocol has the sender (core)
    not release his write buffer until LLC has replied that
    the data arrived without ECC trouble. Thus, the instruction
    causing the latent uncorrectable error is not retired until
    the data has arrived successfully at LLC.

    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    As I describe above, all errors are detected at source read,
    precisely so that they can be detected or recovered.

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???

    I could see some wanting to depend on the copy checking data
    validity synchronously, but some might be okay with a quasi-
    synchronous copy that allows the processor to continue doing work
    outside of the MM.

    As I mentioned before, Yes I intend to allow other instructions
    to operate concurrently with MM, but I also expect MM to consume
    all of L1 cache bandwidth. Just like LD L1-L2-miss operates
    concurrently with FDIV.

    For large copies, I could see having the copying done at L2 or
    even L3 with distinct address generation (at least within a page,

    As you get farther out than L1; you end up not having a TLB to
    translate page crossing addresses.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Oct 20 20:30:45 2024
    From Newsgroup: comp.arch

    On 10/20/2024 5:25 PM, MitchAlsup1 wrote:
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    On 10/16/24 5:14 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:


    Here is a question that I will leave to Mitch:

    Can a MM that has confirmed permissions commit before it has been
    performed such that uncorrectable errors would be recognized not
    on read of the source but on later read of the destination?

    A Memory Move does not <necessarily> read the destination. In
    order to make the data transfers occur in cache line sizes,
    The first and the last line may be read, but the intermediate
    ones are not read (from DRAM) only to be re-written. An
    implementation with byte write enables might not read any
    of the destination lines.

    I was referring to a following instruction reading the
    destination.

    Then there is the issue with uncorrectable errors at the
    receiving cache. The current protocol has the sender (core)
    not release his write buffer until LLC has replied that
    the data arrived without ECC trouble. Thus, the instruction
    causing the latent uncorrectable error is not retired until
    the data has arrived successfully at LLC.

    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    As I describe above, all errors are detected at source read,
    precisely so that they can be detected or recovered.

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???

    I could see some wanting to depend on the copy checking data
    validity synchronously, but some might be okay with a quasi-
    synchronous copy that allows the processor to continue doing work
    outside of the MM.

    As I mentioned before, Yes I intend to allow other instructions
    to operate concurrently with MM, but I also expect MM to consume
    all of L1 cache bandwidth. Just like LD L1-L2-miss operates
    concurrently with FDIV.

    For large copies, I could see having the copying done at L2 or
    even L3 with distinct address generation (at least within a page,

    As you get farther out than L1; you end up not having a TLB to
    translate page crossing addresses.

    Yes, but once you translate the starting address of a page, which you
    can tell from the low order bits X bits (X depends on page size) being
    zero, you don't need to translate again until the next page crossing.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Oct 21 06:32:52 2024
    From Newsgroup: comp.arch

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    Welcome to the DDR5 age (which started in 2021). DDR5 not just has
    on-die ECC, it also has ECS (error check and scrub). From <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

    |An additional feature of the DDR5 SDRAM ECC is the error check and
    |scrub (ECS) function. The ECS function is a read of internal data and
    |the writing back of corrected data if an error occurred. ECS can be
    |used as a manual function initiated by a Multi-Purpose Command (MPC),
    |or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM
    |schedules and performs the ECS commands as needed to complete a full
    |scrub of the data bits in the array within the recommended 24-hour
    |period. At the completion of a full-array scrub, the DDR5 reports the
    |number of errors that were corrected during the scrub (once the error
    |count exceeds a minimum fail threshold) and reports the row with the
    |highest number of errors, which is also subject to a minimum
    |threshold.

    I am wondering why scrubbing is not performed automatically on
    refresh.

    Even before DDR5, scrubbing is a feature of the memory controller that
    it performs in hardware (i.e., without software having to do something
    in an interrupt handler or some such). Of course the My66000 memory
    controller may be less capable, and leave scrubbing to software; maybe
    also refresh?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Oct 21 12:56:59 2024
    From Newsgroup: comp.arch

    On Mon, 21 Oct 2024 06:32:52 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    Welcome to the DDR5 age (which started in 2021). DDR5 not just has
    on-die ECC, it also has ECS (error check and scrub). From <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

    |An additional feature of the DDR5 SDRAM ECC is the error check and
    |scrub (ECS) function. The ECS function is a read of internal data and
    |the writing back of corrected data if an error occurred. ECS can be
    |used as a manual function initiated by a Multi-Purpose Command (MPC),
    |or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM |schedules and performs the ECS commands as needed to complete a full
    |scrub of the data bits in the array within the recommended 24-hour
    |period. At the completion of a full-array scrub, the DDR5 reports the |number of errors that were corrected during the scrub (once the error
    |count exceeds a minimum fail threshold) and reports the row with the |highest number of errors, which is also subject to a minimum
    |threshold.

    I am wondering why scrubbing is not performed automatically on
    refresh.


    Typical DDR5 row contains 4096 data bits. That's 32x bigger than
    internal ECC block. In order to do scrub at refresh without major
    increase in refresh timing one would need a lot more ECC correction
    logic than currently present.
    Even with a lot of extra logic there will be some slowdown, likely one
    clock (== 16T == 2.5 to 3.3 ns).

    Even before DDR5, scrubbing is a feature of the memory controller that
    it performs in hardware (i.e., without software having to do something
    in an interrupt handler or some such). Of course the My66000 memory controller may be less capable, and leave scrubbing to software; maybe
    also refresh?


    Hopefully the last part is tongue in cheek.

    - anton


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Oct 22 13:08:38 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???

    IF an error is detected then it should be required to at least be logged.
    But that particular error, no it should not be required to be detected as
    that would force an extra read cycle onto every quadword store.

    By logging I mean a fifo buffer that error reports can be dumped into.
    This ensures that the fact an error was detected is not lost.
    This can be set to trigger a high priority interrupt on a number of
    conditions such as fifo buffer 3/4 full, or specific kinds of errors,
    or when a log has been in the buffer a while, say 1 sec.

    Also that error should not raise an error exception in an application
    for any number of reasons, such as most writes are lazy and could be
    long after the app that did the store has been switched out.

    But each error must be assessed on an individual basis.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Oct 22 18:13:25 2024
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    MitchAlsup1 wrote:
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???

    IF an error is detected then it should be required to at least be logged.

    Even corrected errors must be logged for proper RAS. Undetected errors,
    such as Mitch has described, by definition can't be logged.

    If that corrupted data was read, the read transaction should be tagged
    as poisoned, and _when consumed_[*], an error should be raised
    and/or logged. If it is never consumed, it's an open question whether
    logging is useful or required.

    [*] If it were a speculative read, for example, that was never
    consumed, should it also raise/log and error?

    <snip>

    By logging I mean a fifo buffer that error reports can be dumped into.

    Typical reports should include the bank/ram location information
    to at least narrow down to an FRU.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 22 18:16:25 2024
    From Newsgroup: comp.arch

    On Tue, 22 Oct 2024 17:08:38 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???

    IF an error is detected then it should be required to at least be
    logged. But that particular error, no it should not be required to
    be detected as that would force an extra read cycle onto every
    quadword store.

    This, too, is my interpretation--in a similar way that one can use
    bad-ECC to denote uninitialized data which "goes away" when the
    data gets written, Memory Moving good-data over bad-data makes
    the error "go away".

    By logging I mean a fifo buffer that error reports can be dumped into.

    As I specified, all data errors are available at execute time
    and can be properly trapped with precision.

    This ensures that the fact an error was detected is not lost.
    This can be set to trigger a high priority interrupt on a number of conditions such as fifo buffer 3/4 full, or specific kinds of errors,
    or when a log has been in the buffer a while, say 1 sec.

    Also that error should not raise an error exception in an application
    for any number of reasons, such as most writes are lazy and could be
    long after the app that did the store has been switched out.

    In My 66000 all written data is available prior to retirement, and
    can be detected/corrected before the store is retired.

    But each error must be assessed on an individual basis.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 22 18:21:04 2024
    From Newsgroup: comp.arch

    On Mon, 21 Oct 2024 9:56:59 +0000, Michael S wrote:

    On Mon, 21 Oct 2024 06:32:52 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    Welcome to the DDR5 age (which started in 2021). DDR5 not just has
    on-die ECC, it also has ECS (error check and scrub). From
    <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

    |An additional feature of the DDR5 SDRAM ECC is the error check and
    |scrub (ECS) function. The ECS function is a read of internal data and
    |the writing back of corrected data if an error occurred. ECS can be
    |used as a manual function initiated by a Multi-Purpose Command (MPC),
    |or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM
    |schedules and performs the ECS commands as needed to complete a full
    |scrub of the data bits in the array within the recommended 24-hour
    |period. At the completion of a full-array scrub, the DDR5 reports the
    |number of errors that were corrected during the scrub (once the error
    |count exceeds a minimum fail threshold) and reports the row with the
    |highest number of errors, which is also subject to a minimum
    |threshold.

    I am wondering why scrubbing is not performed automatically on
    refresh.


    Typical DDR5 row contains 4096 data bits. That's 32x bigger than
    internal ECC block. In order to do scrub at refresh without major
    increase in refresh timing one would need a lot more ECC correction
    logic than currently present.

    The standard 64+8 code is SECDED and takes 8 gate delays (5-XOR gates,
    2 NAND gates, 1 XOR gate).
    One can do as many 64+8 error checks as one wants in those gates of
    delay.

    It is only when one wants better than SEC/DED that things get
    interesting.

    Even with a lot of extra logic there will be some slowdown, likely one
    clock (== 16T == 2.5 to 3.3 ns).

    DRAM clock should not be that much slower than CPU clock -- a factor
    of 2-3 is expected (logic delay), putting DRAM clock in the 0.75ns
    range.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Oct 22 14:45:09 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    MitchAlsup1 wrote:
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.
    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???
    IF an error is detected then it should be required to at least be logged.

    Even corrected errors must be logged for proper RAS. Undetected errors,
    such as Mitch has described, by definition can't be logged.

    If that corrupted data was read, the read transaction should be tagged
    as poisoned, and _when consumed_[*], an error should be raised
    and/or logged. If it is never consumed, it's an open question whether logging is useful or required.

    The scenario given was a store that overwrites the error data completely. Assuming a 8+64 SECDED ECC, no it should not be required to be detected,
    and therefore not logged or raised, as that would force an unnecessary
    read cycle on every quadword memory store just to check for something
    that very rarely occurs.

    Compared to say a byte memory store which is a RMW cycle.
    If the read portion detects an error then it should be logged.
    But again not raised as an error because store are mostly asynchronous
    and could be long after the store instruction.


    [*] If it were a speculative read, for example, that was never
    consumed, should it also raise/log and error?

    I'm drawing a distinction between logging, which is asynchronous,
    and raising some kind of exception, which requires the detection
    be synchronous with the code.

    But even if an error is synchronous, if it is transient *and corrected*, whether memory or a bus error, or even an internal register parity error
    (if it had such checks), I want logged but not elevated to an exception.

    Speculative read is the same as a scrubber read error detect -
    log but no exception because it is asynchronous to code.

    <snip>

    By logging I mean a fifo buffer that error reports can be dumped into.

    Typical reports should include the bank/ram location information
    to at least narrow down to an FRU.

    Yes, then software might take the physical addresses from the disk log
    and backtrack to the physical page frames affected, and if not transient
    then mark those pages to be retired the next time they get recycled.

    Oh, and a pop-up "You have memory errors"



    --- Synchronet 3.20a-Linux NewsLink 1.114