Forum: War Ensemble BBS

MM instruction and the pipeline

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Oct 15 22:56:34 2024

From Newsgroup: comp.arch

Even though this is about the MM instruction, and the MM instruction is mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought
it best to start a new thread,

My questions are about what happens to subsequent instructions that immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.

I will phrase this as a series of questions.

1. I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?

2. Can a load or store where the memory address is in neither the source nor the destination of the MM proceed simultaneously with the MM

3. Can a load where the memory address is within the source of the MM proceed?

For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.

4. Can a load in the first third of the destination range proceed?

5. Can a store in the first third of the source range proceed?

6. Can a store in the first third of the destination range proceed?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 19:26:46 2024

From Newsgroup: comp.arch

On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

Even though this is about the MM instruction, and the MM instruction is mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought
it best to start a new thread,

My questions are about what happens to subsequent instructions that immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.

I will phrase this as a series of questions.

1. I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?

Yes, they may begin but they cannot retire.

2. Can a load or store where the memory address is in neither the source nor the destination of the MM proceed simultaneously with the MM

Yes, in higher end implementations--after checking for no-conflict
{and this is dependent on accessing DRAM not MMI/O or config spaces}

3. Can a load where the memory address is within the source of the MM proceed?

It is just read data, so, yes--at least theoretically.

For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.

4. Can a load in the first third of the destination range proceed?

5. Can a store in the first third of the source range proceed?

6. Can a store in the first third of the destination range proceed?

In all 3 of these cases; one much have a good way to determine what has
already been MMed and what is waiting to be MMed. A low end
implementation
is unlikely to have such, a high end will have such.

On the other hand, MM is basically going to saturate the cache ports
(if for no other reason than being as fast as it can be) so, there
may not be a lot of AGEN capability or cache access port availability.
So, the faster one makes MM (and by extension MS) the less one needs
of overlap and pipelining.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Wed Oct 16 16:48:39 2024

From Newsgroup: comp.arch

On 10/16/24 1:56 AM, Stephen Fuld wrote:

Even though this is about the MM instruction, and the MM
instruction is mentioned in other threads, they have lots of other
stuff (thread drift), and this isn't related to C, standard or
otherwise, so I thought it best to start a new thread,

My questions are about what happens to subsequent instructions
that immediately follow the MM in the stream when an MM
instruction is executing. Since an MM instruction may take quite
a long time (in computer time) to complete I think it is useful to
know what else can happen while the MM is executing.

This would seem to be very implementation dependent.
Architecturally, no following instructions can execute until after
the MM completes. With respect to microarchitecture, an arbitrary
amount of parallelism could be provided.

I will phrase this as a series of questions.

While Mitch Alsup can answer these more authoritatively, I will
take a stab at them.

1.    I assume that subsequent non-memory reference instructions
can proceed simultaneously with the MM. Is that correct?

This would probably be true even for the in-order scalar
implementation.

2.    Can a load or store where the memory address is in neither
the source nor the destination of the MM proceed simultaneously
with the MM

This is a little more complicated than just marking a register as
not-ready (for a load destination), so might not be supported in
a simple implementation. Memory accesses would have to check both
ranges rather than just one of 32 register names or eight store
buffer entries.

Mitch Alsup's description of the small quasi-scalar core implies
to me that the MM instruction would occupy the memory access
interface until it is finished.

I would guess that any out-of-order implementation would support
loads and stores outside of the MM regions to proceed
speculatively until the various OoO buffering structures are
filled.

3.    Can a load where the memory address is within the source of
the MM proceed?

My guess would be that any OoO implementation would support this.
If the implementation checks for a hit in both ranges, it would
seem to be little extra effort to allow a load to a 'clean'
address to proceed.

Supporting this and preventing reads of the destination and all
stores would only require one address range check; loads can
proceed as long as they are not within the destination.

For the next questions, assume for exposition that the MM has
proceeded to complete 1/3 of the move when the following
instructions come up.

4.    Can a load in the first third of the destination range proceed?

I would guess that an out-of-order implementation would forward
data from all stores performed speculatively by the MM (limited by
the store queue). MM stores that are no longer speculative — where
an interrupt would place the count — would seem to be naturally
handled as if singular committed stores, i.e., following
instructions could speculatively execute using those values.

5.    Can a store in the first third of the source range proceed?

In the non-speculative region of the MM, speculative stores could
"execute", storing to the store queue. These stores would be
squashed if the MM does not fully complete along with all other
instructions after the MM. The MM is synchronous.

A large MM that is no longer speculative might be implemented as
avoiding the store queue to allow more stores after the MM to be
speculated. For very large MMs, a copy engine farther from the
core might be used.

6.    Can a store in the first third of the destination range
proceed?

Since the MM has architecturally completed to roughly that point
(some stores might only have "completed" to the store queue), it
would not be difficult to support speculative stores in the
completed range for an out-of-order implementation. These stores
would be rolled back if the MM does not fully complete and commit.

Here is a question that I will leave to Mitch:

Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?

I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.

If a translation map is provided for coherence, any MM could
commit once it is not speculative but before the actual copy has
been performed. Tracking what parts have been completed in the
presence of other stores would have significant overhead.

For page-aligned copies, a copy-on-write mechanism might be used.

There are also cache designs which support deduplication; cache
block aligned copies might be faster than physical copying. With lossy/truncated cache compression, unaligned fragments might be
deduplicated (and read-for-ownership might be avoided similar to
having fine-grained valid bits).

I rather suspect that what is physically possible is far broader
than what is possible with a finite engineering budget.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 21:14:37 2024

From Newsgroup: comp.arch

On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:

Here is a question that I will leave to Mitch:

Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?

A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.

Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.

I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.

As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.

If a translation map is provided for coherence, any MM could
commit once it is not speculative but before the actual copy has
been performed. Tracking what parts have been completed in the
presence of other stores would have significant overhead.

In practice, one is not going to allow MM to get farther than
the miss buffer ahead of a mispredict shadow.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Oct 17 08:49:08 2024

From Newsgroup: comp.arch

On 10/16/2024 12:26 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

Even though this is about the MM instruction, and the MM instruction is
mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought
it best to start a new thread,

My questions are about what happens to subsequent instructions that
immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.

I will phrase this as a series of questions.

1.    I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?

Yes, they may begin but they cannot retire.

2.    Can a load or store where the memory address is in neither the
source
nor the destination of the MM proceed simultaneously with the MM

Yes, in higher end implementations--after checking for no-conflict
{and this is dependent on accessing DRAM not MMI/O or config spaces}

3.    Can a load where the memory address is within the source of the MM >> proceed?

It is just read data, so, yes--at least theoretically.

For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.

4.    Can a load in the first third of the destination range proceed?

5.    Can a store in the first third of the source range proceed?

6.    Can a store in the first third of the destination range proceed?

In all 3 of these cases; one much have a good way to determine what has already been MMed and what is waiting to be MMed. A low end
implementation
is unlikely to have such, a high end will have such.

On the other hand, MM is basically going to saturate the cache ports
(if for no other reason than being as fast as it can be) so, there
may not be a lot of AGEN capability or cache access port availability.

Yes, but. For a large transfer, say many hundreds to thousands of
bytes, why run the "middle" bytes through the cache, especially the L1
(as you indicated in reply to Paul)? It would take some analysis of
traces to know for sure, but I would expect the probability of reuse of
such bytes to be low. If that is true, it would take far less resources
(and avoid "sweeping" the cache) to do at least the intermediate reads
and writes into just L3, or even a dedicated very small buffer or two. Furthermore, for the transfers after the first, unless there is a page crossing, why go through a full AGEN, when a simple add to the previous address is all that is required, thus freeing AGEN resources.

So, the faster one makes MM (and by extension MS) the less one needs
of overlap and pipelining.

Certainly true for small transfers, but for larger ones, I am not so
sure. It may make more sense to delay the MM completion slightly for
the time it takes for a single load to take place in order to allow the non-memory reference instructions following that load to execute
overlapped with the completion of the MM. Needs trace analysis.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Oct 17 13:16:24 2024

From Newsgroup: comp.arch

Stephen Fuld wrote:

On 10/16/2024 12:26 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

Even though this is about the MM instruction, and the MM instruction is
mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought >>> it best to start a new thread,

My questions are about what happens to subsequent instructions that
immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.

I will phrase this as a series of questions.

1. I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?

Yes, they may begin but they cannot retire.

2. Can a load or store where the memory address is in neither the
source
nor the destination of the MM proceed simultaneously with the MM

Yes, in higher end implementations--after checking for no-conflict
{and this is dependent on accessing DRAM not MMI/O or config spaces}

3. Can a load where the memory address is within the source of the MM >>> proceed?

It is just read data, so, yes--at least theoretically.

For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.

4. Can a load in the first third of the destination range proceed?

5. Can a store in the first third of the source range proceed?

6. Can a store in the first third of the destination range proceed?

In all 3 of these cases; one much have a good way to determine what has
already been MMed and what is waiting to be MMed. A low end
implementation
is unlikely to have such, a high end will have such.

On the other hand, MM is basically going to saturate the cache ports
(if for no other reason than being as fast as it can be) so, there
may not be a lot of AGEN capability or cache access port availability.

Yes, but. For a large transfer, say many hundreds to thousands of
bytes, why run the "middle" bytes through the cache, especially the L1
(as you indicated in reply to Paul)? It would take some analysis of
traces to know for sure, but I would expect the probability of reuse of
such bytes to be low. If that is true, it would take far less resources (and avoid "sweeping" the cache) to do at least the intermediate reads
and writes into just L3, or even a dedicated very small buffer or two. Furthermore, for the transfers after the first, unless there is a page crossing, why go through a full AGEN, when a simple add to the previous address is all that is required, thus freeing AGEN resources.

So, the faster one makes MM (and by extension MS) the less one needs
of overlap and pipelining.

Certainly true for small transfers, but for larger ones, I am not so
sure. It may make more sense to delay the MM completion slightly for
the time it takes for a single load to take place in order to allow the non-memory reference instructions following that load to execute
overlapped with the completion of the MM. Needs trace analysis.

MM is a (long) series of byte LD and ST to virtual addresses.
The ordering rules for MM relative to scalar LD and ST before and after it should be no different, other than the exact order of individual MM bytes
moved is not defined other than it is overlap safe.

But the same bypassing and forwarding rules apply.
E.g. Under TSO, MM being a sequence of stores cannot start until it is at
the end of the Load Store Queue and ready to retire. So all older LD and ST must have retired and we only need consider interactions of MM with younger
LD and ST.

If an implementation allows a younger LD [x] to bypass an older
MM &dst, &src, then LSQ must retain the LD [x] someplace so it can
check if at some later time, perhaps far later, that a MM store to the
physical address of &dst overlaps the physical address of [x].
If it does overlap then LSQ must trigger a replay of the younger LD [x]
so it picks up the new value from dst buffer.

A younger store ST to any address cannot be seen to bypass an older MM
store to any address, though it could prefetch.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 17 21:49:19 2024

From Newsgroup: comp.arch

On Thu, 17 Oct 2024 17:16:24 +0000, EricP wrote:

Stephen Fuld wrote:

On 10/16/2024 12:26 PM, MitchAlsup1 wrote:

MM is a (long) series of byte LD and ST to virtual addresses.
The ordering rules for MM relative to scalar LD and ST before and after
it should be no different, other than the exact order of individual MM
bytes moved is not defined other than it is overlap safe.

But the same bypassing and forwarding rules apply.
E.g. Under TSO, MM being a sequence of stores cannot start until it is
at the end of the Load Store Queue and ready to retire. So all older LD
and
ST must have retired and we only need consider interactions of MM with younger LD and ST.

My 66000 is NOT TSO it is causal unless special areas are being
accessed.

But, yes, MM must operate as if it is ordered with other memory
reference
instructions.

If an implementation allows a younger LD [x] to bypass an older
MM &dst, &src, then LSQ must retain the LD [x] someplace so it can
check if at some later time, perhaps far later, that a MM store to the physical address of &dst overlaps the physical address of [x].
If it does overlap then LSQ must trigger a replay of the younger LD [x]
so it picks up the new value from dst buffer.

No essential disagreement.

A younger store ST to any address cannot be seen to bypass an older MM
store to any address, though it could prefetch.

Once again it is not TSO.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sun Oct 20 11:57:59 2024

From Newsgroup: comp.arch

On 10/16/24 5:14 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:

Here is a question that I will leave to Mitch:

Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?

A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.

I was referring to a following instruction reading the
destination.

Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.

As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.

For large copies, I could see having the copying done at L2 or
even L3 with distinct address generation (at least within a page,
possibly crossing page boundaries if associated with a prefetcher
that crosses page boundaries and so does address translation).
A stride based prefetcher would have the address generation
capability to process the streaming access of MM.

If a translation map is provided for coherence, any MM could
commit once it is not speculative but before the actual copy has
been performed. Tracking what parts have been completed in the
presence of other stores would have significant overhead.

In practice, one is not going to allow MM to get farther than
the miss buffer ahead of a mispredict shadow.

Once the MM itself is non-speculative (i.e., branches/exceptions
in the path to it have all resolved), it seems an MM could
progress as far as permissions have been confirmed.

With its synchronous interface, it seems that for sufficiently
large MM operations one might want to context switch the MM off of
a high performance core and onto a MM engine so that the core
could be used for other work.

Given the cache misses (including predictors) from a context
switch, this might not be worthwhile even if the MM engine was
substantially more energy efficient.

Forcing a thread to pause while a "large" MM operation is done
_feels_ wrong.

I guess one could architect a fork+terminate operation that
would allow a non-speculative but physically incomplete copy to
commit and then "end" that thread while continuing in a "new"
thread from the instruction after the fork. Such an interface
seems clunky but might be more general than just having a
quasi-synchronous MM with the similar behavior.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Oct 21 00:25:57 2024

From Newsgroup: comp.arch

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

On 10/16/24 5:14 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:

Here is a question that I will leave to Mitch:

Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?

A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.

I was referring to a following instruction reading the
destination.

Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

As I describe above, all errors are detected at source read,
precisely so that they can be detected or recovered.

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.

As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.

For large copies, I could see having the copying done at L2 or
even L3 with distinct address generation (at least within a page,

As you get farther out than L1; you end up not having a TLB to
translate page crossing addresses.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Oct 20 20:30:45 2024

From Newsgroup: comp.arch

On 10/20/2024 5:25 PM, MitchAlsup1 wrote:

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

On 10/16/24 5:14 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:

Here is a question that I will leave to Mitch:

Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?

A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.

I was referring to a following instruction reading the
destination.

Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

As I describe above, all errors are detected at source read,
precisely so that they can be detected or recovered.

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.

As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.

For large copies, I could see having the copying done at L2 or
even L3 with distinct address generation (at least within a page,

As you get farther out than L1; you end up not having a TLB to
translate page crossing addresses.

Yes, but once you translate the starting address of a page, which you
can tell from the low order bits X bits (X depends on page size) being
zero, you don't need to translate again until the next page crossing.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Oct 21 06:32:52 2024

From Newsgroup: comp.arch

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

Welcome to the DDR5 age (which started in 2021). DDR5 not just has
on-die ECC, it also has ECS (error check and scrub). From <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

|An additional feature of the DDR5 SDRAM ECC is the error check and
|scrub (ECS) function. The ECS function is a read of internal data and
|the writing back of corrected data if an error occurred. ECS can be
|used as a manual function initiated by a Multi-Purpose Command (MPC),
|or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM
|schedules and performs the ECS commands as needed to complete a full
|scrub of the data bits in the array within the recommended 24-hour
|period. At the completion of a full-array scrub, the DDR5 reports the
|number of errors that were corrected during the scrub (once the error
|count exceeds a minimum fail threshold) and reports the row with the
|highest number of errors, which is also subject to a minimum
|threshold.

I am wondering why scrubbing is not performed automatically on
refresh.

Even before DDR5, scrubbing is a feature of the memory controller that
it performs in hardware (i.e., without software having to do something
in an interrupt handler or some such). Of course the My66000 memory
controller may be less capable, and leave scrubbing to software; maybe
also refresh?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Oct 21 12:56:59 2024

From Newsgroup: comp.arch

On Mon, 21 Oct 2024 06:32:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

Welcome to the DDR5 age (which started in 2021). DDR5 not just has
on-die ECC, it also has ECS (error check and scrub). From <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

|An additional feature of the DDR5 SDRAM ECC is the error check and
|scrub (ECS) function. The ECS function is a read of internal data and
|the writing back of corrected data if an error occurred. ECS can be
|used as a manual function initiated by a Multi-Purpose Command (MPC),
|or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM |schedules and performs the ECS commands as needed to complete a full
|scrub of the data bits in the array within the recommended 24-hour
|period. At the completion of a full-array scrub, the DDR5 reports the |number of errors that were corrected during the scrub (once the error
|count exceeds a minimum fail threshold) and reports the row with the |highest number of errors, which is also subject to a minimum
|threshold.

I am wondering why scrubbing is not performed automatically on
refresh.

Typical DDR5 row contains 4096 data bits. That's 32x bigger than
internal ECC block. In order to do scrub at refresh without major
increase in refresh timing one would need a lot more ECC correction
logic than currently present.
Even with a lot of extra logic there will be some slowdown, likely one
clock (== 16T == 2.5 to 3.3 ns).

Even before DDR5, scrubbing is a feature of the memory controller that
it performs in hardware (i.e., without software having to do something
in an interrupt handler or some such). Of course the My66000 memory controller may be less capable, and leave scrubbing to software; maybe
also refresh?

Hopefully the last part is tongue in cheek.

- anton

--- Synchronet 3.20a-Linux NewsLink 1.114

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Oct 22 13:08:38 2024

From Newsgroup: comp.arch

MitchAlsup1 wrote:

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

IF an error is detected then it should be required to at least be logged.
But that particular error, no it should not be required to be detected as
that would force an extra read cycle onto every quadword store.

By logging I mean a fifo buffer that error reports can be dumped into.
This ensures that the fact an error was detected is not lost.
This can be set to trigger a high priority interrupt on a number of
conditions such as fifo buffer 3/4 full, or specific kinds of errors,
or when a log has been in the buffer a while, say 1 sec.

Also that error should not raise an error exception in an application
for any number of reasons, such as most writes are lazy and could be
long after the app that did the store has been switched out.

But each error must be assessed on an individual basis.

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Oct 22 18:13:25 2024

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

MitchAlsup1 wrote:

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

IF an error is detected then it should be required to at least be logged.

Even corrected errors must be logged for proper RAS. Undetected errors,
such as Mitch has described, by definition can't be logged.

If that corrupted data was read, the read transaction should be tagged
as poisoned, and _when consumed_[*], an error should be raised
and/or logged. If it is never consumed, it's an open question whether
logging is useful or required.

[*] If it were a speculative read, for example, that was never
consumed, should it also raise/log and error?

<snip>

By logging I mean a fifo buffer that error reports can be dumped into.

Typical reports should include the bank/ram location information
to at least narrow down to an FRU.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 22 18:16:25 2024

From Newsgroup: comp.arch

On Tue, 22 Oct 2024 17:08:38 +0000, EricP wrote:

MitchAlsup1 wrote:

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

IF an error is detected then it should be required to at least be
logged. But that particular error, no it should not be required to
be detected as that would force an extra read cycle onto every
quadword store.

This, too, is my interpretation--in a similar way that one can use
bad-ECC to denote uninitialized data which "goes away" when the
data gets written, Memory Moving good-data over bad-data makes
the error "go away".

By logging I mean a fifo buffer that error reports can be dumped into.

As I specified, all data errors are available at execute time
and can be properly trapped with precision.

This ensures that the fact an error was detected is not lost.
This can be set to trigger a high priority interrupt on a number of conditions such as fifo buffer 3/4 full, or specific kinds of errors,
or when a log has been in the buffer a while, say 1 sec.

Also that error should not raise an error exception in an application
for any number of reasons, such as most writes are lazy and could be
long after the app that did the store has been switched out.

In My 66000 all written data is available prior to retirement, and
can be detected/corrected before the store is retired.

But each error must be assessed on an individual basis.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 22 18:21:04 2024

From Newsgroup: comp.arch

On Mon, 21 Oct 2024 9:56:59 +0000, Michael S wrote:

On Mon, 21 Oct 2024 06:32:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

Welcome to the DDR5 age (which started in 2021). DDR5 not just has
on-die ECC, it also has ECS (error check and scrub). From
<https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

|An additional feature of the DDR5 SDRAM ECC is the error check and
|scrub (ECS) function. The ECS function is a read of internal data and
|the writing back of corrected data if an error occurred. ECS can be
|used as a manual function initiated by a Multi-Purpose Command (MPC),
|or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM
|schedules and performs the ECS commands as needed to complete a full
|scrub of the data bits in the array within the recommended 24-hour
|period. At the completion of a full-array scrub, the DDR5 reports the
|number of errors that were corrected during the scrub (once the error
|count exceeds a minimum fail threshold) and reports the row with the
|highest number of errors, which is also subject to a minimum
|threshold.

I am wondering why scrubbing is not performed automatically on
refresh.

Typical DDR5 row contains 4096 data bits. That's 32x bigger than
internal ECC block. In order to do scrub at refresh without major
increase in refresh timing one would need a lot more ECC correction
logic than currently present.

The standard 64+8 code is SECDED and takes 8 gate delays (5-XOR gates,
2 NAND gates, 1 XOR gate).
One can do as many 64+8 error checks as one wants in those gates of
delay.

It is only when one wants better than SEC/DED that things get
interesting.

Even with a lot of extra logic there will be some slowdown, likely one
clock (== 16T == 2.5 to 3.3 ns).

DRAM clock should not be that much slower than CPU clock -- a factor
of 2-3 is expected (logic delay), putting DRAM clock in the 0.75ns
range.

--- Synchronet 3.20a-Linux NewsLink 1.114

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Oct 22 14:45:09 2024

From Newsgroup: comp.arch

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

MitchAlsup1 wrote:

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

IF an error is detected then it should be required to at least be logged.

Even corrected errors must be logged for proper RAS. Undetected errors,
such as Mitch has described, by definition can't be logged.

If that corrupted data was read, the read transaction should be tagged
as poisoned, and _when consumed_[*], an error should be raised
and/or logged. If it is never consumed, it's an open question whether logging is useful or required.

The scenario given was a store that overwrites the error data completely. Assuming a 8+64 SECDED ECC, no it should not be required to be detected,
and therefore not logged or raised, as that would force an unnecessary
read cycle on every quadword memory store just to check for something
that very rarely occurs.

Compared to say a byte memory store which is a RMW cycle.
If the read portion detects an error then it should be logged.
But again not raised as an error because store are mostly asynchronous
and could be long after the store instruction.

[*] If it were a speculative read, for example, that was never
consumed, should it also raise/log and error?

I'm drawing a distinction between logging, which is asynchronous,
and raising some kind of exception, which requires the detection
be synchronous with the code.

But even if an error is synchronous, if it is transient *and corrected*, whether memory or a bus error, or even an internal register parity error
(if it had such checks), I want logged but not elevated to an exception.

Speculative read is the same as a scrubber read error detect -
log but no exception because it is asynchronous to code.

<snip>

By logging I mean a fifo buffer that error reports can be dumped into.

Typical reports should include the bank/ram location information
to at least narrow down to an FRU.

Yes, then software might take the physical addresses from the disk log
and backtrack to the physical page frames affected, and if not transient
then mark those pages to be retired the next time they get recycled.

Oh, and a pop-up "You have memory errors"

--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	991
Nodes:	10 (0 / 10)
Uptime:	119:20:55
Calls:	12,958
Files:	186,574
Messages:	3,265,634

MM instruction and the pipeline

Who's Online

System Info