Even though this is about the MM instruction, and the MM instruction is mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought
it best to start a new thread,
My questions are about what happens to subsequent instructions that immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.
I will phrase this as a series of questions.
1. I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?
2. Can a load or store where the memory address is in neither the source nor the destination of the MM proceed simultaneously with the MM
3. Can a load where the memory address is within the source of the MM proceed?
For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.
4. Can a load in the first third of the destination range proceed?
5. Can a store in the first third of the source range proceed?
6. Can a store in the first third of the destination range proceed?
Even though this is about the MM instruction, and the MM
instruction is mentioned in other threads, they have lots of other
stuff (thread drift), and this isn't related to C, standard or
otherwise, so I thought it best to start a new thread,
My questions are about what happens to subsequent instructions
that immediately follow the MM in the stream when an MM
instruction is executing. Since an MM instruction may take quite
a long time (in computer time) to complete I think it is useful to
know what else can happen while the MM is executing.
I will phrase this as a series of questions.
1. I assume that subsequent non-memory reference instructions
can proceed simultaneously with the MM. Is that correct?
2. Can a load or store where the memory address is in neither
the source nor the destination of the MM proceed simultaneously
with the MM
3. Can a load where the memory address is within the source of
the MM proceed?
For the next questions, assume for exposition that the MM has
proceeded to complete 1/3 of the move when the following
instructions come up.
4. Can a load in the first third of the destination range proceed?
5. Can a store in the first third of the source range proceed?
6. Can a store in the first third of the destination range
proceed?
Here is a question that I will leave to Mitch:
Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?
I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.
If a translation map is provided for coherence, any MM could
commit once it is not speculative but before the actual copy has
been performed. Tracking what parts have been completed in the
presence of other stores would have significant overhead.
On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:
Even though this is about the MM instruction, and the MM instruction is
mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought
it best to start a new thread,
My questions are about what happens to subsequent instructions that
immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.
I will phrase this as a series of questions.
1. I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?
Yes, they may begin but they cannot retire.
2. Can a load or store where the memory address is in neither the
source
nor the destination of the MM proceed simultaneously with the MM
Yes, in higher end implementations--after checking for no-conflict
{and this is dependent on accessing DRAM not MMI/O or config spaces}
3. Can a load where the memory address is within the source of the MM >> proceed?
It is just read data, so, yes--at least theoretically.
For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.
4. Can a load in the first third of the destination range proceed?
5. Can a store in the first third of the source range proceed?
6. Can a store in the first third of the destination range proceed?
In all 3 of these cases; one much have a good way to determine what has already been MMed and what is waiting to be MMed. A low end
implementation
is unlikely to have such, a high end will have such.
On the other hand, MM is basically going to saturate the cache ports
(if for no other reason than being as fast as it can be) so, there
may not be a lot of AGEN capability or cache access port availability.
So, the faster one makes MM (and by extension MS) the less one needs
of overlap and pipelining.
On 10/16/2024 12:26 PM, MitchAlsup1 wrote:
On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:
Even though this is about the MM instruction, and the MM instruction is
mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought >>> it best to start a new thread,
My questions are about what happens to subsequent instructions that
immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.
I will phrase this as a series of questions.
1. I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?
Yes, they may begin but they cannot retire.
2. Can a load or store where the memory address is in neither the
source
nor the destination of the MM proceed simultaneously with the MM
Yes, in higher end implementations--after checking for no-conflict
{and this is dependent on accessing DRAM not MMI/O or config spaces}
3. Can a load where the memory address is within the source of the MM >>> proceed?
It is just read data, so, yes--at least theoretically.
For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.
4. Can a load in the first third of the destination range proceed?
5. Can a store in the first third of the source range proceed?
6. Can a store in the first third of the destination range proceed?
In all 3 of these cases; one much have a good way to determine what has
already been MMed and what is waiting to be MMed. A low end
implementation
is unlikely to have such, a high end will have such.
On the other hand, MM is basically going to saturate the cache ports
(if for no other reason than being as fast as it can be) so, there
may not be a lot of AGEN capability or cache access port availability.
Yes, but. For a large transfer, say many hundreds to thousands of
bytes, why run the "middle" bytes through the cache, especially the L1
(as you indicated in reply to Paul)? It would take some analysis of
traces to know for sure, but I would expect the probability of reuse of
such bytes to be low. If that is true, it would take far less resources (and avoid "sweeping" the cache) to do at least the intermediate reads
and writes into just L3, or even a dedicated very small buffer or two. Furthermore, for the transfers after the first, unless there is a page crossing, why go through a full AGEN, when a simple add to the previous address is all that is required, thus freeing AGEN resources.
So, the faster one makes MM (and by extension MS) the less one needs
of overlap and pipelining.
Certainly true for small transfers, but for larger ones, I am not so
sure. It may make more sense to delay the MM completion slightly for
the time it takes for a single load to take place in order to allow the non-memory reference instructions following that load to execute
overlapped with the completion of the MM. Needs trace analysis.
Stephen Fuld wrote:
On 10/16/2024 12:26 PM, MitchAlsup1 wrote:
MM is a (long) series of byte LD and ST to virtual addresses.
The ordering rules for MM relative to scalar LD and ST before and after
it should be no different, other than the exact order of individual MM
bytes moved is not defined other than it is overlap safe.
But the same bypassing and forwarding rules apply.
E.g. Under TSO, MM being a sequence of stores cannot start until it is
at the end of the Load Store Queue and ready to retire. So all older LD
and
ST must have retired and we only need consider interactions of MM with younger LD and ST.
If an implementation allows a younger LD [x] to bypass an older
MM &dst, &src, then LSQ must retain the LD [x] someplace so it can
check if at some later time, perhaps far later, that a MM store to the physical address of &dst overlaps the physical address of [x].
If it does overlap then LSQ must trigger a replay of the younger LD [x]
so it picks up the new value from dst buffer.
A younger store ST to any address cannot be seen to bypass an older MM
store to any address, though it could prefetch.
On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:
Here is a question that I will leave to Mitch:
Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?
A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.
Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.
I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.
As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.
If a translation map is provided for coherence, any MM could
commit once it is not speculative but before the actual copy has
been performed. Tracking what parts have been completed in the
presence of other stores would have significant overhead.
In practice, one is not going to allow MM to get farther than
the miss buffer ahead of a mispredict shadow.
On 10/16/24 5:14 PM, MitchAlsup1 wrote:
On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:
Here is a question that I will leave to Mitch:
Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?
A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.
I was referring to a following instruction reading the
destination.
Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.
I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).
With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).
Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.
I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.
As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.
For large copies, I could see having the copying done at L2 or
even L3 with distinct address generation (at least within a page,
On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:
On 10/16/24 5:14 PM, MitchAlsup1 wrote:
On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:
Here is a question that I will leave to Mitch:
Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?
A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.
I was referring to a following instruction reading the
destination.
Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.
I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).
As I describe above, all errors are detected at source read,
precisely so that they can be detected or recovered.
With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).
Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.
The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???
I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.
As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.
For large copies, I could see having the copying done at L2 or
even L3 with distinct address generation (at least within a page,
As you get farther out than L1; you end up not having a TLB to
translate page crossing addresses.
I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).
Welcome to the DDR5 age (which started in 2021). DDR5 not just has
on-die ECC, it also has ECS (error check and scrub). From <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:
|An additional feature of the DDR5 SDRAM ECC is the error check and
|scrub (ECS) function. The ECS function is a read of internal data and
|the writing back of corrected data if an error occurred. ECS can be
|used as a manual function initiated by a Multi-Purpose Command (MPC),
|or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM |schedules and performs the ECS commands as needed to complete a full
|scrub of the data bits in the array within the recommended 24-hour
|period. At the completion of a full-array scrub, the DDR5 reports the |number of errors that were corrected during the scrub (once the error
|count exceeds a minimum fail threshold) and reports the row with the |highest number of errors, which is also subject to a minimum
|threshold.
I am wondering why scrubbing is not performed automatically on
refresh.
Even before DDR5, scrubbing is a feature of the memory controller that
it performs in hardware (i.e., without software having to do something
in an interrupt handler or some such). Of course the My66000 memory controller may be less capable, and leave scrubbing to software; maybe
also refresh?
- anton
On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:
With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).
Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.
The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???
MitchAlsup1 wrote:
On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:
With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).
Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.
The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???
IF an error is detected then it should be required to at least be logged.
By logging I mean a fifo buffer that error reports can be dumped into.
MitchAlsup1 wrote:
On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:
With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).
Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.
The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???
IF an error is detected then it should be required to at least be
logged. But that particular error, no it should not be required to
be detected as that would force an extra read cycle onto every
quadword store.
By logging I mean a fifo buffer that error reports can be dumped into.
This ensures that the fact an error was detected is not lost.
This can be set to trigger a high priority interrupt on a number of conditions such as fifo buffer 3/4 full, or specific kinds of errors,
or when a log has been in the buffer a while, say 1 sec.
Also that error should not raise an error exception in an application
for any number of reasons, such as most writes are lazy and could be
long after the app that did the store has been switched out.
But each error must be assessed on an individual basis.--- Synchronet 3.20a-Linux NewsLink 1.114
On Mon, 21 Oct 2024 06:32:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).
Welcome to the DDR5 age (which started in 2021). DDR5 not just has
on-die ECC, it also has ECS (error check and scrub). From
<https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:
|An additional feature of the DDR5 SDRAM ECC is the error check and
|scrub (ECS) function. The ECS function is a read of internal data and
|the writing back of corrected data if an error occurred. ECS can be
|used as a manual function initiated by a Multi-Purpose Command (MPC),
|or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM
|schedules and performs the ECS commands as needed to complete a full
|scrub of the data bits in the array within the recommended 24-hour
|period. At the completion of a full-array scrub, the DDR5 reports the
|number of errors that were corrected during the scrub (once the error
|count exceeds a minimum fail threshold) and reports the row with the
|highest number of errors, which is also subject to a minimum
|threshold.
I am wondering why scrubbing is not performed automatically on
refresh.
Typical DDR5 row contains 4096 data bits. That's 32x bigger than
internal ECC block. In order to do scrub at refresh without major
increase in refresh timing one would need a lot more ECC correction
logic than currently present.
Even with a lot of extra logic there will be some slowdown, likely one
clock (== 16T == 2.5 to 3.3 ns).
EricP <ThatWouldBeTelling@thevillage.com> writes:
MitchAlsup1 wrote:
On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:IF an error is detected then it should be required to at least be logged.
With source-signaled errors, MM might be used to scrub memoryThe real question, here, is: if you have corrupted data in memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).
Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???
Even corrected errors must be logged for proper RAS. Undetected errors,
such as Mitch has described, by definition can't be logged.
If that corrupted data was read, the read transaction should be tagged
as poisoned, and _when consumed_[*], an error should be raised
and/or logged. If it is never consumed, it's an open question whether logging is useful or required.
[*] If it were a speculative read, for example, that was never
consumed, should it also raise/log and error?
<snip>
By logging I mean a fifo buffer that error reports can be dumped into.
Typical reports should include the bank/ram location information
to at least narrow down to an FRU.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 991 |
Nodes: | 10 (0 / 10) |
Uptime: | 119:20:55 |
Calls: | 12,958 |
Files: | 186,574 |
Messages: | 3,265,634 |